Is there a difference between both models in this regard?
Hey Harpreet,
There is a difference. For inworld-tts-1, expect ~200-400ms time-to-first-chunk in streaming mode. Overall latency depends on text length, but we’re optimized for real-time applications. inworld-tts-max is slower but more expressive, right now it’s not ideal for real-time use cases.
Pro tip: Always use streaming for interactive experiences. The perceived latency is much better.