Hertz-dev, the first open-source base model for conversational audio

290 points by mnk47 3 days ago | 53 comments

mnk47 3 days ago |
Repo: https://github.com/Standard-Intelligence/hertz-dev
wg0 3 days ago |
So it is kind of LLM but audio LLM where prompt is audio and generated output is audio too?
ryukoposting 3 days ago |
Yes, as far as I can tell that's exactly what's happening.
lordofgibbons 3 days ago |
Can it effectively be used as a TTS model?
Tepix 3 days ago |
It doesn't know about text.
BrandiATMuhkuh 3 days ago |
That's really cool. I'm currently exploring VUI (Voice User Interface) and this might come in handy.
I might be a bit biased (did my PhD exploring how VUI can persuade humans), but I think VUI is "the future" of computer interaction. If it's not the future, than at least it adds a new group of people (kids + elderly people) as potential users.
wwwlouishinofun 3 days ago |
yes, there are blind people
tmshapland 11 hours ago |
I'm really interested in voice user interfaces. What are you building? Do you have a link?
wwwlouishinofun 3 days ago |
Tesla’s approach to pure vision-based autonomous driving—temporarily setting aside lidar and other sensors—seems designed to make this technology more accessible and scalable. By focusing on a vision-only model, they can accelerate adoption and gather large datasets for quicker iterations. Once the vision-based system reaches a mature stage, I imagine Tesla might reintegrate additional sensor data, like lidar or radar, to refine their autonomous driving suite, making it even more robust and closer to perfection.
Additionally, I’ve been exploring an idea about voice interaction systems. Currently, most voice interactions are processed by converting voice input into text, generating a text-based response, and then turning this text back into audio. But what if we could train the system to respond directly in voice, without involving text at all? If developed to maturity, this model could produce responses that feel more natural and spontaneous, possibly diverging from traditional text-to-speech outputs. Natural speech has unique syntax and rhythm, not to mention dialect and tone variations, which could make a purely voice-trained system fascinating and more human-like.
Could you let me know if your current voice interaction model follows the standard speech-to-text-to-speech process, or if there is exploration in voice-to-voice processing?
nicholas-cc 3 days ago |
I'm one of the devs. Our model is fully voice-to-voice, no text was involved in the making of hertz-dev for exactly this reason.
oidar 3 days ago |
So essentially this is voice input to voice output? Can you change gender/age/accent? Does it track prosodic information? I've been waiting for something like this.
nicholas-cc 3 days ago |
Hertz-dev is a base model, meaning it's just trained to predict the next token of audio. If your prompt is an old male voice with a British accent, the model will most likely continue speaking in an old male voice with a British accent. Being a base model, hertz-dev is easily finetunable for specific tasks - it would be a simple change to add manual configurations for the gender/age/accent.
hunter2_ 3 days ago |
I assume this mirroring is due to symmetry being more typical than not among the training data, and if instead trained with contrived diversity (e.g., males only conversing with females) then the output of the base model would follow suit without pulling any levers?
It's interesting to think about what complete diversity (i.e., no tendencies toward homogeneous conversation partners whatsoever among training data) would yield, given that it's trying to deliver whatever is most probable.
modeless 3 days ago |
I'm interested to hear more detail about approaches to adding manual controls for speaker characteristics or emotion or other things you might want to vary. What techniques do you have in mind?
vessenes 3 days ago |
I’ll jump in here - as a former new englander, the cheerful helping tone of all modern voice llms infuriates me. And the slow speed. And the over explanations. ChatGPT advanced can be induced to talk more quickly, less sycophantically and if I like in a not-bad regional accent; essentially I want it to mirror my tone better. But those inducements don’t stick between sessions.
On the technical side having some sort of continuation or summarization loop on seems interesting to me as a product feature. It’s not enough to build a company off of though. But it would be nice.
wwwlouishinofun 3 days ago |
Oh, you have completed the project I planned. Currently, do you think the difficulty in improving the model lies in voice data, computing power, or algorithm optimization? I personally think that if you want to achieve the ultimate, you don’t need to remove the background sound from the original audio. Outputting audio mixed with background sound as new audio may result in background music,
If you use completely unprocessed speech data (including speech information with background music on YouTube), I think the potential will be higher, but the requirements on your computing power are too high. If you don’t have money to buy a GPU, just use voice noise reduction processing first.
vanviegen 3 days ago |
I think you're describing ChatGPT Advanced Voice Mode (or Realtime API) in your second paragraph.
throwaway314155 3 days ago |
They were so busy inventing they forgot to do a basic Google search to see if it had already been done.
Dawny33 3 days ago |
Congrats, team.
Does Hertz support multi-lingual audio right now?
nicholas-cc 3 days ago |
Yes
wwwlouishinofun 3 days ago |
I'm going to try it in Chinese
awinter-py 3 days ago |
what is up with the first sample? and/or am I having a stroke
spuz 3 days ago |
Pay attention to the given prompt length in the examples. The first 2 seconds of the first example is a real human speaking. Everything after is generated by the model. It produces what almost sounds like real human speech mimicking the voice of the input but it's currently at a level of something like GPT-2 in terms of meaningful words.
reissbaker 3 days ago |
This is really cool. FWIW, existing open-source TTS engines are really bad in comparison to what you have here: I know this is voice-to-voice, but I think there'd be a lot of appetite to get this to also be multimodal and accept text (essentially making it a really good TTS model, in addition to a great voice-to-voice model).
I suppose someone could hack their way around the problem by finetuning it to essentially replay Piper (or whatever) output, only with more natural prosody and intonation. And then have the text LLM pipe to Piper, and Piper pipe to Hertz-dev. But it would be pretty useful to have it accept text natively!
netdevnet 3 days ago |
They are a team of 4. At that size, it's better for them to be focused on one thing than stretched out
reissbaker 3 days ago |
Eh, that depends. A small model that's voice-and-text is probably more useful to most people than scaling up a voice-only model: the large voice-only model will have to compete on intelligence with e.g. Qwen and Llama, since it can't be used in conjunction with them; whereas a small voice+text model can be used as a cheap frontend hiding a larger, smarter, but more expensive text-only model behind it. This is an 8b model: running it is nearly free, it can fit on a 4090 with room to spare.
On the one hand, a small team focused on voice-to-voice could probably do a lot better at voice-to-voice than a small team focused on voice-to-voice+text. But a small team focused on making the most useful model would probably do better at that goal by focusing on voice+text rather than voice-only.
netdevnet 3 days ago |
Their goal is not working on what's most useful for most people though. That's the domain of the big AI players. They are small and so specialising works best as that's where they can have an edge as a company.
At the end of the day, the released product needs to be good and needs to be done in a reasonable amount of time. I highly doubt they can do a generic model as well as a more specialised one.
But if you think you know better than them, you could try to contact them even though it looks they are crazy laser focused (their public email addresses are either for investors or employee candidates).
PenisBanana 3 days ago |
Yes, yes. This. Piper is already pretty good . . . and then this.
It may not be _them_ doing it, though.
blixt 3 days ago |
Gotta say I was confused for a second but yeah apparently si.inc and ssi.inc are the domains for two different AGI companies and I can only assume it’s intentional?
imjonse 3 days ago |
According to whois records si.inc was registered 5 days after ssi.inc in June. So yes, maybe intentional.
programjames 3 days ago |
But the company si.inc (Standard Intelligence) was founded many months before ssi.inc (Safe Superintelligence), so they likely just didn't want their domain name to get taken.
blixt 3 days ago |
They say Hertz is first of its kind but Moshi is another duplex audio model from earlier this year that seems to perform similarly (and it runs on a MacBook): https://github.com/kyutai-labs/moshi
nicholas-cc 3 days ago |
Moshi is a good model to build chat applications on, this is designed to be more of a proper base model with all the quirkiness, naturalness, and researcher-friendliness of base modeling.
a2128 3 days ago |
Moshi never released the base model, only two conversationally finetuned models. They also never released training code except for the codec. Though I don't see any training code for Hertz either, just 3 inference notebooks, and model code full of no_grad. No paper either to help me understand how this was trained and what the architecture is like. So I'm not too sure about researcher-friendliness unless I'm missing something.
nicholas-cc 3 days ago |
We're working on a HuggingFace release that will help with finetuning. We'd like to do a paper, after a larger release - we're a team of 4.
netdevnet 3 days ago |
Very impressive for just 4 people. What's the team background and how long have you been working on this?
unit149 3 days ago |
For a rag-tag group of transcendental audiophiles operating electronic circuitry, it ionizes and atomizes well.
programjames 3 days ago |
I'm not part of their team, but lived with them for a couple months. They've been working on it for ~5 months, and their background is 16-20 year olds who are too smart for university.
underlines 3 days ago |
- LLaMA-Omni https://github.com/ictnlp/LLaMA-Omni a speech-language model built on Llama-3.1-8B-Instruct for simultaneous generation of text and speech
- moshi https://github.com/kyutai-labs/moshi speech-text foundation model using Mimi, a SOTA streaming neural audio codec
- Mini-Omni https://github.com/gpt-omni/mini-omni multimodal LLM based on Qwen2 offering speech input and output
- Ichigo https://github.com/homebrewltd/ichigo open research project extending a text-based LLM to have native listening ability, using an early fusion technique
xarope 3 days ago |
the One-channel generation seems to be speaking gibberish english. I'm not sure what it is supposed to represent?
And is the interactive generation just doing an ELIZA? i.e. "P: tell us about how AI will be interesting", "A: Yeah AI will, yeah, be interesting".
briansm 3 days ago |
The codec parameters remind me of the ~300bps NRV military speech codec from 2010. It also uses 120ms (8hz) frames, vbr encoded using 16KHz audio (closed source though).
https://ieeexplore.ieee.org/document/5680311
codedokode 3 days ago |
The voice sounds a little bit distorted, and there is often a noise in the background (especially noticeable when this noise disappears when the voice pauses). I wonder, is it model limitations or is it the problem with quality of training data?
jcims 3 days ago |
If the authors or anyone else that works on a voice model are in here, do you ever get creeped out or feel the sounds you’re getting from the system have a physiological effect on you?
ryukoposting 3 days ago |
The voice samples are speaking gibberish a lot of the time, but sonically the voices are fantastic. They sound human, even if it's nonsense syllables.
With SD and LLMs, there's a lot you can do to debug it by studying the way it responds to small changes in the prompt. But, since Hertz-dev is using sound as its input, it would be hard to discern which token you should tweak. Of course, if it's meant to be used in real time, that kind of fiddling isn't an option at all. How would you go about systematically studying Hertz-dev's behavior?
Jayakumark 3 days ago |
What is the license on model weights ?
kunley 3 days ago |
Anything more about the company, founders, affiliations..?
ttul 3 days ago |
Some commits are by `nicholascc` (https://github.com/nicholascc); via Twitter, he seems to be Nicholas Charette. Nicholas is a first year student at Stanford. For such a young group, this is a really impressive effort!
zachthewf 3 days ago |
Cool, looks like this is trained on 16 million hours of audio (500B tokens at ~.11 seconds per token).
Even the large open source TTS models (see F5 TTS, Mask GCT) are mostly trained on very small audio datasets (say 100k hours) relative to the amount of audio available on the internet, so it's cool to see an open source effort to scale up training significantly.
mazoza 3 days ago |
Can one of the authors explain what this actually means from the post?
hertz-vae: a 1.8 billion parameter transformer decoder which acts as a learned prior for the audio VAE. The model uses a context of 8192 sampled latent representations (17 minutes) and predicts the next encoded audio frame as a mixture of gaussians. 15 bits of quantized information from the next token act as semantic scaffolding to steer the generation in a streamable manner.
programjames 3 days ago |
My guess:
1. `codec`: First, compress 16k samplerate audio into 8 samples per second with convolutions. Then, vector quantize to 128 bits (probably 8 floats) to get a codec. This is not nearly enough bits to actually represent the audio, it's more to represent phenomes.
2. `vae` -> This looks like a VAE-based diffusion model, that uses the codec as its prompt.
3. `dev` -> This is a next-codec prediction model.
Put together, it probably runs like so:
1. Turn your prompt into tokens with the `codec`.
2. If you want s more seconds of audio, use `dev` to predict 8 * s more tokens.
3. Turn it back into audio with the `vae` diffusion model.
mazoza 2 days ago |
I dont actually see any tokens used in the model. It seems like the model actually predicts latents and then VAE converts back to audio. More like Tortoise or XTTS
m11a 3 days ago |
> Base models are uniquely valuable as a research product because they accurately model the distribution of the data that they were trained on, as opposed to models that have had substantial RL tuning done to collapse their generation distributions. This makes base models the best starting point to fine-tune for a large number of different tasks.
Is this idea (‘collapse of their generation distributions’) a researched topic? If so, under what name?
Sounds interesting and maybe related to the whole continual learning / how to finetune properly line of work
timnetworks a day ago |
The Sims could really use this.