I might be a bit biased (did my PhD exploring how VUI can persuade humans), but I think VUI is "the future" of computer interaction. If it's not the future, than at least it adds a new group of people (kids + elderly people) as potential users.
Additionally, I’ve been exploring an idea about voice interaction systems. Currently, most voice interactions are processed by converting voice input into text, generating a text-based response, and then turning this text back into audio. But what if we could train the system to respond directly in voice, without involving text at all? If developed to maturity, this model could produce responses that feel more natural and spontaneous, possibly diverging from traditional text-to-speech outputs. Natural speech has unique syntax and rhythm, not to mention dialect and tone variations, which could make a purely voice-trained system fascinating and more human-like.
Could you let me know if your current voice interaction model follows the standard speech-to-text-to-speech process, or if there is exploration in voice-to-voice processing?
It's interesting to think about what complete diversity (i.e., no tendencies toward homogeneous conversation partners whatsoever among training data) would yield, given that it's trying to deliver whatever is most probable.
On the technical side having some sort of continuation or summarization loop on seems interesting to me as a product feature. It’s not enough to build a company off of though. But it would be nice.
If you use completely unprocessed speech data (including speech information with background music on YouTube), I think the potential will be higher, but the requirements on your computing power are too high. If you don’t have money to buy a GPU, just use voice noise reduction processing first.
Does Hertz support multi-lingual audio right now?
I suppose someone could hack their way around the problem by finetuning it to essentially replay Piper (or whatever) output, only with more natural prosody and intonation. And then have the text LLM pipe to Piper, and Piper pipe to Hertz-dev. But it would be pretty useful to have it accept text natively!
On the one hand, a small team focused on voice-to-voice could probably do a lot better at voice-to-voice than a small team focused on voice-to-voice+text. But a small team focused on making the most useful model would probably do better at that goal by focusing on voice+text rather than voice-only.
At the end of the day, the released product needs to be good and needs to be done in a reasonable amount of time. I highly doubt they can do a generic model as well as a more specialised one.
But if you think you know better than them, you could try to contact them even though it looks they are crazy laser focused (their public email addresses are either for investors or employee candidates).
It may not be _them_ doing it, though.
- moshi https://github.com/kyutai-labs/moshi speech-text foundation model using Mimi, a SOTA streaming neural audio codec
- Mini-Omni https://github.com/gpt-omni/mini-omni multimodal LLM based on Qwen2 offering speech input and output
- Ichigo https://github.com/homebrewltd/ichigo open research project extending a text-based LLM to have native listening ability, using an early fusion technique
And is the interactive generation just doing an ELIZA? i.e. "P: tell us about how AI will be interesting", "A: Yeah AI will, yeah, be interesting".
With SD and LLMs, there's a lot you can do to debug it by studying the way it responds to small changes in the prompt. But, since Hertz-dev is using sound as its input, it would be hard to discern which token you should tweak. Of course, if it's meant to be used in real time, that kind of fiddling isn't an option at all. How would you go about systematically studying Hertz-dev's behavior?
Even the large open source TTS models (see F5 TTS, Mask GCT) are mostly trained on very small audio datasets (say 100k hours) relative to the amount of audio available on the internet, so it's cool to see an open source effort to scale up training significantly.
hertz-vae: a 1.8 billion parameter transformer decoder which acts as a learned prior for the audio VAE. The model uses a context of 8192 sampled latent representations (17 minutes) and predicts the next encoded audio frame as a mixture of gaussians. 15 bits of quantized information from the next token act as semantic scaffolding to steer the generation in a streamable manner.
1. `codec`: First, compress 16k samplerate audio into 8 samples per second with convolutions. Then, vector quantize to 128 bits (probably 8 floats) to get a codec. This is not nearly enough bits to actually represent the audio, it's more to represent phenomes.
2. `vae` -> This looks like a VAE-based diffusion model, that uses the codec as its prompt.
3. `dev` -> This is a next-codec prediction model.
Put together, it probably runs like so:
1. Turn your prompt into tokens with the `codec`.
2. If you want s more seconds of audio, use `dev` to predict 8 * s more tokens.
3. Turn it back into audio with the `vae` diffusion model.
Is this idea (‘collapse of their generation distributions’) a researched topic? If so, under what name?
Sounds interesting and maybe related to the whole continual learning / how to finetune properly line of work