I need to understand the latency I should be able to achieve for an STT-LLM-TTS turn
Hey Mike,
The latency profile for end-to-end (audio input → audio output) depends a variety of factors, but you should be able to achieve something in the following range:
- STT: ~100-300ms
- LLM: 500ms-2s (depends on model)
- TTS: ~200-400ms (streaming)
Total: ~1-3 seconds for a full conversational turn. We recommend you use streaming TTS to reduce perceived latency.