What latency do you have when running through a full conversational pipeline? Not just TTS

I need to understand the latency I should be able to achieve for an STT-LLM-TTS turn

Hey Mike,

The latency profile for end-to-end (audio input → audio output) depends a variety of factors, but you should be able to achieve something in the following range:

  • STT: ~100-300ms
  • LLM: 500ms-2s (depends on model)
  • TTS: ~200-400ms (streaming)

Total: ~1-3 seconds for a full conversational turn. We recommend you use streaming TTS to reduce perceived latency.