The Realtime API
12 points by doener 2 days ago | 5 comments
  • M4v3R 2 days ago |
    We were investigating speech-to-speech for a project before and estimated that creating an end to end solution with the previous method would take us weeks at best for the MVP (because the pipeline was basically: speech -> whisper STT model -> text, retrieval, API calls, etc. -> prompt -> LLM -> text -> TTS model -> speech). If this works as advertised it could cut the amount of work required quite significantly, excited to try it out (when it’s available in Europe that is…).
    • michaelanckaert 2 days ago |
      For what it's worth, I created an MVP solution using that pipeline that took about 3 days. I used the Azure AI Speech service and SDK. Worked pretty good despite the obvious long pipeline you described.
    • joshstrange 2 days ago |
      It’s not for a production-type thing but Home Assistant has this pipeline built in and you can swap out any of the 3 steps:

      * STT

      * LLM

      * TTS

      It’s pretty cool to be able to replace one of the parts, do some tests, then change another part.

      Again, it’s nothing you would use directly for a product but it’s fairly easy to test your pipeline by plugging into different aspects. (Also HA provides each component out of the box if you want them to handle STT/TTS and just test your LLM).

      • BrutalCoding 2 days ago |
        Add VAD to this list and it’s basically the same stack that I am running on mobile phones (on-device). It doesn’t beat OpenAI’s voice chat in terms of speed and intelligence, but it’s funny.

        The LLM part isn’t great ofc due to the small size. Still experimenting with different models/tweaks until I’m satisfied enough with the total outcome on a recent’ish iPhone/Pixel.

  • serf 2 days ago |
    it's frustrating that things like this get released from oAI but one still cannot use voice on the web-app, nor any of the advanced voice model stuff, without essentially emulating a phone.

    it's hard to know who oAI is working for -- is it a developer resource group or an actual customer-facing business? it feels like they don't know, either.