Show HN: Voice-Pro – AI Voice Cloning

270 points by abuskorea 7 days ago | 189 comments

Imagine creating a podcast where Mark Zuckerberg interviews Elon Musk – using their actual voices?

What sounds like science fiction is now reality.

Voice-Pro is an open-source Gradio WebUI that breaks the boundaries of audio manipulation.

Key Features:

- Zero-shot Voice Cloning

- Voice Changer with 50+ Celebrity Voices

- YouTube Audio Downloading

- Vocal Isolation

- Multi-Language Text-to-Speech (Edge-TTS, F5-TTS)

- Multi-Language Translation

- Powered by Whisper Engines (Whisper, Faster-Whisper, Whisper-Timestamped)

Video Demos:

1. Voice-Pro Usage Tutorial: https://youtu.be/z8g8LMhoh_o

2. Voice Cloning Celebrity Podcast Demo: https://youtu.be/Wfo7vQCD4no

3. Full Demo Playlist: https://www.youtube.com/playlist?list=PLwx5dnMDVC9Y7dAjm9r26...

Whether you're a content creator, developer, or audio experiment enthusiast,

Voice-Pro provides a user-friendly interface to push the boundaries of audio manipulation.

GitHub: https://github.com/abus-aikorea/voice-pro

jncfhnb 7 days ago |
Is there speech to speech? I have been hoping for a model I can use to do voice acting with inflection
amrrs 7 days ago |
Do you mean Inflection's Pi?
bryanrasmussen 7 days ago |
I think they mean speech "in the style of" the same as repaint this picture in the style of Van Gogh, so they will do the audio and put the correct inflection on things but then rerender it with the voice of Katharine Hepburn for example.
on edit: example of course showing the difficulty as so much of Hepburn was her inflection.
jncfhnb 7 days ago |
More so I wish to voice act a line and then have the bot mimic it with a different voice but with the same contextual voicing.
“I’m going to kill you” could be delivered (laughing jokingly / seething with rage / ominously and creepily). I’d like a bot that can mimic the delivery in a different voice.
muglug 7 days ago |
These tools make it very easy to scam vulnerable people, and have pretty limited use otherwise.
tsujamin 7 days ago |
Bulldozing grandma is just the cost of technological progress /s
uh_uh 7 days ago |
This tech is going to be ubiquitous, it's just too easy to distribute it. Grandma better starts adapting now.
thejazzman 7 days ago |
Because people make it so, not because the natural order of the world gets us there
For some reason because we can validates that we should. Any jackass has the power of a research team of phds. It's kinda weird.
uh_uh 7 days ago |
Demanding responsible behaviour from everybody is not going to work. Some people don't care about negative externalities that much and it's enough if only a few of them decide not to play ball. So either grandma needs to adapt which will upset some people or distributing the tech should be regulated/prosecuted which will upset another group of people.
rockemsockem 7 days ago |
I think either way grandma needs to adapt though since Russian scammers and trolls are still going to run scams with fake voices.
123yawaworht456 7 days ago |
how very politically correct of you to pretend it's Russians who scam your grandmas
rockemsockem 4 days ago |
Insert any other country you like that doesn't have extradition agreements with the United States. Any other country the law can at least ostensibly be enforced there, even if it isn't always
chefandy 7 days ago |
Indeed. Humans ascended to dominance because we can cooperate. This every-man-for-themself idea is an aberration, not the natural order as so many claim. It’s rather astounding to think otherwise considering the logistics of how we’re communicating right now.
uh_uh 7 days ago |
Cooperation works if the potential damage caused by a rouge actor is sufficiently low. Otherwise, it's too easy to sabotage things. This is why we don't want random rouge states to have nukes. AI will give so much leverage to rouge actors that it will significantly shift the game theory in favour of not cooperating.
chefandy 7 days ago |
> Cooperation works if the potential damage caused by a rouge actor is sufficiently low. Otherwise, it's too easy to sabotage things. This is why we don't want random rouge states to have nukes. AI will give so much leverage to rouge actors that it will significantly shift the game theory in favour of not cooperating.
Governments successfully collectively controlling dangerous things so they don’t fall into the hands of rogue bad actors fundamentally opposes the extreme individualist every-man-for-himself perspective in every conceivable way. It’s the absolute opposite of “it’s everybody’s responsibility to protect themselves because everybody else is only going to look out for themselves.”
And when individuals have that much leverage, collective action is the only conceivable way to oppose it. Some of those things might be cultural, like mores, some might be laws, some might be more martial. I don’t see how extreme individualism even theoretically could be more powerful.
uh_uh 7 days ago |
Are you suggesting government action against putting up code like this to GitHub? It’s ok if you are, but I want to put into more concrete terms what we’re talking about.
chefandy 6 days ago |
You’re the one that made the direct government control analogy. I mentioned a number of non-individualistic mechanisms in my previous comment. I’m not going to keep engaging in a fishing expedition of things to argue about — I think it’s pretty clear what aspect of your stance I disagree with— and am going to leave it at that.
uh_uh 6 days ago |
So you don't have a concrete suggestion to solve the scamming problem?
chefandy 7 days ago |
You can’t adapt around brain age making it more difficult to distinguish truth from lies.
casey2 7 days ago |
Yeah, I don't really get the hulabaloo, if granny doesn't have the mental fortitude to keep up with the times she shouldn't be managing her own money. I guess better her son/daughter than a scammer but both are better than letting money rot. Put granny on foodstamps and pay $1 for her rent controled housing be done with it.
zelphirkalt 7 days ago |
Are we forgetting, that there are many elderly people without living descendants?
weq 7 days ago |
This tech is not only great for bulldozing grandma, its great at stealing content from other creators and rebranding it as your own. Based on the github, it kind of seems like thats exactly whats being advertised as the use case. Steal content from BBC, cut it up and pull the noise out/vocals/revoice the content so the algorithm cant detect the plagorism easily. The imagine detection is no where no the audio detection for copyright strikes.
There is a massive problem with this on youtube. Pretty much every category on youtube now has a host of these bots trolling content and playing the youtube strike system like a banjo. There are channels detected to showing you how to setup these content mills. This tool can make you good money.
sfjailbird 7 days ago |
First generative AI destroyed Google search, and now it has pretty much destroyed YouTube. Social platforms, including this one, are probably goners too. We live in interesting times.
Larrikin 7 days ago |
I'm absolutely using celebrity voices for my Home Assistant voice. Amazon has spent the last couple years removing the voices for Alexa that people had paid for.
nickthegreek 6 days ago |
I’d love some more info on using custom voices in HA. I have an esp32-s3-box that I am setting up holiday to do voice with HA.
pmarreck 6 days ago |
If you have a how—to, I’d love to work on one for my home. I feel like this is all right around the corner…
chefandy 7 days ago |
Gen AI space to everyone else: “Your computer scientists were so preoccupied with whether or not they should, they didn’t stop to think if they could just do it anyway”
chefandy 7 days ago |
To be fair, they’ve got pretty serious potential for letting tech companies get paid for a seasoned voice actor’s unique delivery, tone, inflection, etc rather than the voice actor themselves.
whaaaaat 7 days ago |
> they’ve got pretty serious potential for letting tech companies get paid for a seasoned voice actor’s unique delivery, tone, inflection, etc rather than the voice actor themselves.
I think you mean "steal the labor of an actor"?
chefandy 7 days ago |
Sure, and people that already agree with you will feel good reading it, but other people who don’t agree see it as an attack. It’s pretty much impossible to slip a new idea into someone’s mind if your approach made them slam the door before even considering it. So what’s the benefit of saying it like that?
gmueckl 7 days ago |
It calls attention to the ethical implications of using a part of someone else's personal identity without their direct involvement.
MrDrMcCoy 7 days ago |
Indirect involvement can still be ok within the confines of a license agreement for using the actor's voice.
gmueckl 7 days ago |
But this requires a legal framework that mandates such licenses and effective emforcement / procecution of violations.
As far as I know, most countries are lagging behind when it comes to updating legislation to set binding rules around that.
ideashower 7 days ago |
> Indirect involvement can still be ok within the confines of a license agreement for using the actor's voice.
This assumes existence of a license agreement or likeness/right of publicity law that prevents unauthorized use. But this is far from the case.
Companies have shown willingness to use actors’ voices to create synthetic voices without permission, compensation, or regard for their livelihoods. [1][2][3]
[1] https://animehunch.com/popular-japanese-voice-actors-band-to...
[2] https://www.theatlantic.com/technology/archive/2024/05/eleve...
[3] https://www.yahoo.com/entertainment/morgan-freeman-calls-una...
MrDrMcCoy 6 days ago |
Of course we need laws in place to require such licensing. The fact that people are having their voice stolen now does not mean that there should never be a case where a voice can legally be cloned and used by a third party.
ideashower 5 days ago |
Precisely. We must recognize this as a fundamental issue of workers’ rights and personal autonomy in the digital age, beyond viewing it as a technical challenge. Without proper protections, voice cloning technology risks concentrating power in large companies and undermining creative workers’ economic security.
chefandy 6 days ago |
So does what I said. Someone taking pay for someone else’s work is pretty unambiguously shitty. But when you call taking anything that isn’t a physical item theft, a large percentage of people— especially in the ‘data wants to be free’ crowd— will roll their eyes, think “that’s ridiculous... they aren’t stealing anything. That voice actor still has their voice” and just stop listening. The only people that feel the impact of statements like that are people that already agree. It turns it from an intellectual discussion to a reinforcement of existing tribes. Divisive language works for rallying those who already agree around a specific cause but it’s not even useless— it’s counterproductive— for changing people’s minds. When’s the last time someone you disagreed with changed your mind by being more aggressive towards your stance, and more terse in their portrayal of the dichotomy? If you can even think of one time that it has, you’re in the extreme minority.
ranger_danger 7 days ago |
How many victims will it take for lawmakers to do something about this?
tiborsaas 7 days ago |
It's already illegal to scam somebody. While it's always positive to protect people more, what can be done here? Any alternative I can imagine is massively oppressive of the current state of the software industry.
You can regulate large companies, you can regulate published software sold for profit, but it's impossible to regulate free and open source tools.
You essentially have to regulate access to computing power if you want to prevent bad actors doing bad things using these sort of tools.
bryanrasmussen 7 days ago |
>You can regulate large companies, you can regulate published software sold for profit, but it's impossible to regulate free and open source tools.
Regulation is putting legal limitations on things, if it is impossible to regulate free and open source tools then it would be impossible to regulate murder and lots of other things, but it turns out it isn't impossible, sure - murder happens - but people get caught for it and punished.
Sorry, but this argument is much like the early internet triumphalism - back when people said it was impossible to regulate. Turns out lots of countries now regulate it.
tiborsaas 7 days ago |
It depends on what you do with the tool. Going with your murder analogy, if there's a stabbing epidemic what do you do? 1) Ban knives 2) invest in public safety 3) investigate the root causes and improve on them?
I'm also not sure what's so regulated about the internet besides net neutrality in certain countries. Of course the government can put limits on the network, like banning services, but it's easy since they are rather easy to target. With content traveling on the network it's much harder to say if it's legit or not.
> lots of countries
What about those countries that don't regulate it and people will keep pumping out better, leaner and faster models from there? Spreading software is trivial, all you achieve is the public won't be aware of what's possible.
The more I think about it if anything should be regulated that's a requirement to provide third party (probably government backed) ID verification system so it would be possible for my mom to know it's me calling here. Basically kill called ID spoofing.
bryanrasmussen 7 days ago |
>I'm also not sure what's so regulated about the internet besides net neutrality in certain countries.
generally things are regulated on the internet that were not going to ever be regulated because it was on the internet - example - sales taxes, perhaps you are old enough to remember when sales tax collection would not ever be enforceable on internet transactions - those idiot lawyer don't know, it's on the internet, the sale didn't happen in that country or in that state no sales taxes will never happen on the internet hah hah. It's unenforceable, it is logically undoable, there are so many edge cases - ugh, the law just does not understand technology!
oops, sales taxes now on internet purchases.
GDPR is another example of things that are regulated on the internet that basically most of HN years before it happened was completely convinced would be impossible!!
If this thing becomes too big a problem for the societies regulations will be done, with varying levels of effectiveness I'm sure.
And then in twenty years time we will be saying what, you can't regulate genital eating viral synths because a guy can make those in his garage and spread them via nasal spray, this technology is unstoppable and unregulatable, not like some open source deepfake library!!
bavell 7 days ago |
It's always amusing listening to techies' musings on law... lots of misunderstandings, I suspect due to the helpful but inaccurate "code but for humans" analogy.
Obligatory/relevant xkcd: https://xkcd.com/538/
vunderba 7 days ago |
Lots of countries impose exactly what specific regulations with respect to open source tooling?
The closest thing I can think of is maybe the regulation of DRM ripping tools, but they're still out there in the wild and determined actors can easily get ahold of them. So I'm not at all confident that regulation will have any measurable meaningful effect.
notTooFarGone 7 days ago |
The fable of the "determined actor".
The "determined actor" can get bombs, tanks, fissure material. There noone says "WHELP they can get it anyway so why bother regulating it LMAO" - somehow this is different in anything not physical?
bryanrasmussen 7 days ago |
>Lots of countries impose exactly what specific regulations with respect to open source tooling?
that something is not currently regulated does not mean it can never be regulated, further it does not seem likely that they would regulate open source tooling but rather some uses and if they open source tooling allowed those uses then what would happen is -
github and other big sources of code would refuse to host it as containing not legally allowed things, so for example if they regulated it in the U.S then Github stops allowing it, and everyone moves to some European git provider.
At the same time bigger companies will stop using the library because liability.
Europe then regulates and can't be in European git repos.. at some point many devs abandon particular library because not worth it (I get it this is actually for the love of doing the illegal thing so they won't abandon but despite the power of love most things in this world do not actually run on it)
Can determined actors get ahold of them and do the things with them the law forbids them to do, sure! That's called crime. Then law enforcement catches determined actors and puts them in prison, that's called the real world!
Will criminals stop - nope because there is benefit to what they're doing. Maybe some will stop because they will think screw it I can make more money working for the man. And some will be caught sooner or later. And maybe in version two of the regulations there will be AI enhancements - this crime was committed with AI allowing us to take all your belongings and add 10 years to your sentence and deprive you of the right to ever own a computing device again...etc. etc. And some people will stop and others will get more violent and aggressive about their criminal business.
I don't know necessarily what measurable meaningful effect means, for some people it will be measurable and meaningful, for some not, for some of society the regulation would in many ways be worse than what it is fighting against. I'm not saying regulation will solve problems 100%, I'm just saying this whole they can't regulate us thing because "TECH!!!" that developers seem to regularly go through with anything they set their eye on is a pipe dream.
mnau 7 days ago |
> impossible to regulate free and open source tools
BS. Can you imagine a legislation? Yes, thus it can be done.
As an early example, the CRA (Cyber Resilience Act) already contains provisions about open source stewards and security. So far they are legal persons, aka foundations, but could easily relate to any contributor or maintainer.
tiborsaas 6 days ago |
As I made the comment, I can't really imagine anything that's not so absurd that has a more than zero chance of happening.
Seriously, what can anybody do about random hacker Joe publishing under the name XoX? Even if they burn GitHub and friends to the ground, if something is useful it will be really really hard to get rid of it. Remember youtube-dl? It's now https://github.com/yt-dlp/yt-dlp
If they make anything that cripples open source development they will feel it quite soon when they realize that it also cripples their world as much of the tooling and infrastructure also depends on it.
Killing open source is like killing the internet itself.
mnau 6 days ago |
Consequences never stopped anyone.
Your example with yt-dl doesn't matter.
Open source/free software inherently relies on copyright and all state legal infrastructure. Once you operate outside, it's no longer open source/free software.
Can you host software in a way that's really hard to block? Sure. There is onion routing and plethora of other options.
But that's no longer open source/free software. You are in a realm of dark web and marketplaces.
I do maintain a semi popular open source project that I took over after about a year of inactivity and I seriously considered quiting because of CRA. It's quite easy to cripple/kill something when it basically runs on volunteering of your free time.
russell_h 7 days ago |
Serious question: what do you think lawmakers should do?
ideashower 7 days ago |
For people's image being used without their permission: strengthen U.S. right of publicity laws with private right of action, enabling people to sue for unauthorized use of their voice or likeness.
ranger_danger 6 days ago |
Digital signatures as part of audio/video that can't be easily modified or faked which can trace the origin of a piece of media. Some camera manufacturers are already working on it.
CamperBob2 6 days ago |
How do you propose to keep watermark-free models out of the hands of evildoers? I can't build my own digital camera or laser printer, but I can certainly write software.
ranger_danger 6 days ago |
I don't have a good solution, but maybe legislation helps. There may not be a foolproof solution but I think the more that such devices are widely used, the less likelihood there may be of e.g. a court case hinged on bad evidence.
123yawaworht456 7 days ago |
how many victims did it take for lawmakers to do something about Photoshop/GIMP/etc?
rockemsockem 7 days ago |
Quit being a doomer or keep it to yourself. This reminds me of the sound boards that were popular in the early 2000s except way more versatile. Some things are just good for people to have fun and THAT'S OKAY.
whaaaaat 7 days ago |
People are allowed to recognize the realistic negative outcomes of technology, especially on a forum that frequently discusses the tradeoffs of modern, cutting edge technologies.
rockemsockem 7 days ago |
So many AI posts are overrun with this kind of complaining from folks with limited imaginations.
On a forum that frequently discusses technology with enthusiasm you'd think there'd be more enthusiasm and more constructive criticism instead of blanket write-offs.
Mordisquitos 7 days ago |
I would argue that being able to see the drawbacks and potential negative externalities of a new technology is not a sign of a "limited imagination", but quite the contrary. An actual display of a limited imagination is the inability to imagine how a new technology can (and will) be abused in society by bad actors.
Ukv 7 days ago |
Developing some insight on its negative potential could demonstrate imagination, but the claim that it could be used to scam people is pretty much just rote repetition by now - an obligatory point made in every article and under every post about this tech (and not something that I think actually works out in practice the way most imagine it, since cold-call scam operations that dial numbers at a huge scale expecting most not to pick up can't really find a voice clip prior to each automated call).
As for positive applications, some I see:
* Allowing those with speech impairments to communicate using their natural voice again
* Allowing those uncomfortable with their natural voice, such as transgender people, to communicate closer to how they wish to be perceived
* Translation of a user's voice, maintaining emotion and intonation, for natural cross-language communication on calls
* Professional-quality audio from cheap microphone setups (for video tutorials, indie games, etc.)
* Doing character voices for a D&D session, audiobook, etc.
* Customization of voice assistants, such as to use a native accent/dialect
* Movies, podcasts, audiobooks, news broadcasts, etc. made available in a huge range of languages
* If integrated with something like airpods, babelfish-like automatic isolation and translation of any speech around you
* Privacy from being able to communicate online or record videos without revealing your real voice, which I think is why many (myself included) currently resort to text-only
* New forms of interactive media - customised movies, audio dramas where the listener plays a role, videogame NPCs that react with more than just prerecorded lines, etc.
* And of course: memes, satire, and parody
I appreciate HN's general view on technologies like encrypted messaging - not falling into "we need to ban this now because pedophiles could use it" hysteria. But for anything involving machine learning, I'm concerned how often the hacker mentality seems to go out the window and we instead get people advocating for it to be made illegal to host the code, for instance.
Mordisquitos 6 days ago |
Of the 11 positive applications that you listed, only the 1st, 3rd, 11th and arguably the 4th would benefit from voice cloning, which is what's being promoted here. The rest are solved merely by (improved) TTS and do not require the cloning of any actual human voice.
Also, notice how the legitimate use-cases 1, 3 and 4 imply the user consenting to clone their own voice, which is fine. However, the only use-case which would require cloning a specific human voice belonging to a third party, use-case 11, is "memes, satire, and parody"... and not much imagination is needed to see how steep and buttery that Teflon slippery slope is.
Ukv 6 days ago |
> Of the 11 positive applications that you listed, only the 1st, 3rd, 11th and arguably the 4th would benefit from voice cloning, which is what's being promoted here. The rest are solved merely by (improved) TTS and do not require the cloning of any actual human voice.
2, 5, 6, 9: It's true that in theory all you need is some way to capture the characteristics of a desired voice, but voice-cloning methods are the way to do this currently. If you want a voice assistant with a native accent, you fine-tune on the voice of a native speaker - opposed to turning a bunch of dials manually.
7, 8, 10: Here I think there is benefit specifically from sounding like a particular person. The dynamically generated lines of movie characters/videogame NPCs should be consistent with the actor's pre-recorded lines, for instance, and hearing someone in their own voice is more natural for communication and makes conversation easier to follow.
Pedantically, what's promoted here is a tool which features voice cloning prominently but not exclusively - other workflows demonstrated (like generating subtitles) seem mostly unobjectionable.
> Also, notice how the legitimate use-cases 1, 3 and 4 imply the user consenting to clone their own voice, which is fine
I think all, outside of potentially 8 and 11, could be done with full consent of the voice being cloned - an agreement with the movie actor to use their voice for dubbing to other languages, for example. That's already a significant number of use-cases for this tool.
> use-case 11, is "memes, satire, and parody"... and not much imagination is needed to see how steep and buttery that Teflon slippery slope is.
IMO prohibition around satire/parody would be the slippery slope, particularly with the potential for selective enforcement.
rockemsockem 3 days ago |
This is a GitHub repo, not an article on the effects of TTS. Policy discussions at the level of the parent comment feel off topic.
casey2 7 days ago |
I like tools like these cause they make zero trust default even more obvious, and their "pretty limited use" is saving people hours of work.
anonzzzies 7 days ago |
They are pretty good for leaving messages for my blind friend. I generally find calling / voice texts a waste of time (I type and read far faster than I talk or listen, not to mention the ability to reread etc), but my blind friend prefers getting voice messages when on his phone and this works for us. I type and send and when he comes back with something, Whisper makes it into text for me.
mistercow 7 days ago |
It’s weird to me that people look at a technology and then assume from their reckoning that they know all the uses for that technology immediately. Most technological progress happens because someone notices a creative use for something that already exists which nobody else has noticed.
yawnxyz 7 days ago |
> When Windows Defender mistakenly recognizes a [virus] as a Trojan, this is often called a 'False Positive'. To solve this problem, you can go through the following steps:
kfarr 7 days ago |
Yeah I also noticed the install instructions is run this batch file that gets administrator access and starts downloading things…
gruez 7 days ago |
It's not any worse than all the projects on github with an "easy" install instructions of "curl ... | sudo sh". Heck, even an innocent "sudo make install" command can easily contain a malicious payload.
chefandy 7 days ago |
Yeah it’s not great but it’s definitely not unusual. And windows reputation-based execution blocking does have false positives. I work for a company that has some very very popular products and some that only see a few dozen downloads per week, and despite being signed, it still takes a while for new versions to build enough rep to not trigger the block.
tonyedgecombe 7 days ago |
It's not really the sort of tool that should require admin rights though.
wutwutwat 6 days ago |
Not to mention a directory full of binaries which could do who knows what. The author is asking people turn off their antivirus, execute their code as admin, and be fine with it running binary files doing whatever
https://github.com/abus-aikorea/voice-pro/tree/main/app
71bw 6 days ago |
If it requires dependencies, how else do you expect it to work?
tonyedgecombe 2 days ago |
Vendoring.
elif 7 days ago |
Yea not to mention the entire homebrew ecosystem is built around trusting random people's shell scripts.
MacOS devs blindly trust it like it's the app store.
pmarreck 6 days ago |
A simple `brew cat <packagename>` (possibly piping to bat if you want syntax highlighting) should spit out the ruby install formula for that package, for inspection.
nozzlegear 6 days ago |
The assumption is that maintainers at Homebrew are reviewing each pull request before being merged, though it's obviously not a full security audit. Homebrew will also use macOS's sandboxing if a formula needs to be built during installation, which will limit file access to specific Homebrew directories and restrict network access.
But I agree that everyone should review the Homebrew install script for any package they're installing if they're concerned about security.
wutwutwat 6 days ago |
Not to mention a directory full of binaries which could do who knows what. The author is asking people turn off their antivirus, execute their code as admin, and be fine with it running binary files doing whatever
https://github.com/abus-aikorea/voice-pro/tree/main/app
safeimp 7 days ago |
Project looks interesting. Are there short term plans to support MacOS?
If not, any recommendations for alternative projects?
ilrwbwrkhv 7 days ago |
There are a bunch of yc start-ups who are building new models and stuff in the space. I fear they are going to get decimated really soon as the quality of local llamas keep improving.
shannifin 7 days ago |
I don't have much real use for celebrity voices (other than fun experimentation), but I'd love to be able to clone my own voice and character voices for the purposes of creating audiobooks / audioplays without having to pay monthly fees with monthly usage limits. So I'm excited by this sort of project!
P.S. Are there any tools for synthetic voice creation? Maybe melding two or more voices together, or just exploring latent space? Would be fun for character creation to create completely new voices.
dyauspitr 7 days ago |
I’ve used tortoise tts before and trained it on my voice and a mix of voices. It’s not perfect but still impressive.
thelittleone 7 days ago |
Have you tried eleven labs? I used that. Had to record 3 hours of training audio reading books and and news articles. But the result was really good.
shannifin 7 days ago |
They're great! They just cost too much for how much output I want.
stavros 7 days ago |
How much did the training cost?
vunderba 7 days ago |
I'd be interested as well. This is where I imagine the space is going - particularly as the potential for litigation increases around cloning.
Game studios will spin up a bunch of unique virtual voices for all the dialogue of extras. It'll probably be longer before we see replacements of main characters though. There's been some research in speech-to-speech transference as well - this means that company employee A records the character B's line with the appropriate emotional nuance (angry, sad, etc.) and the emotional aspect is copied on top of the generated TTS.
jerpint 7 days ago |
StyleTTSv2 is pretty good and open source, you can easily traverse its latent space for voice
__jonas 6 days ago |
Similarly, I’m not excited by “voice cloning” at all, but I’d like to have very high quality, natural sounding TTS. All of the projects that do that seem to be geared towards also allowing arbitrary voice cloning based on short audio clips I’ve noticed.
joshdavham 7 days ago |
Looks cool! Also, is there a reason you went with a Web-UI instead of making a native desktop app?
harryf 7 days ago |
Have you considered supporting whisper-at - https://github.com/YuanGongND/whisper-at ? Being able to identify sounds on a timeline can be useful e.g. politicians speech and how the audience is reacting to it (e.g. clapping, applauding)
newusertoday 7 days ago |
are there any TTS models which are decent but can work on devices without GPU and have relatively low RAM(4GB)
grahamgooch 7 days ago |
Great stuff well done. What is your latency for real time Audio?
whaaaaat 7 days ago |
> Imagine creating a podcast where Mark Zuckerberg interviews Elon Musk – using their actual voices?
I'm imagining it. It sucks to imagine.
I'm imagining it being used to scam people. I'm imagining it to leech off of performers who have worked very hard to build a recognizable voice (and it is a lot of work to speak like a performer). I'm imagining how this will be used in revenge porn. I'm imagining how this will be used to circumvent access to voice controlled things.
This is bad. You should feel bad.
And I know you are thinking, "Wait, but I worked really hard on this!" Sorry, I appreciate that it might be technically impressive, but you've basically come out with "we've invented a device that mixes bleach and ammonia automatically in your bedroom! It's so efficient at mixing those two, we can fill a space with chlorine gas in under 10 seconds! Imagine a world where every bedroom could become a toxic site with only the push of a button.
That this is posted here, proudly, is quite frankly astoundingly embarrassing for you.
farzd 7 days ago |
You do realise this is not the first AI release to clone voices?
yyuugg 7 days ago |
I don't think the parent said they were. "I'm the Nth person to do a shitty thing!" doesn't absolve them of doing a shitty thing. Just because there are other thieves doesn't make theft ok.
cess11 7 days ago |
Sure, and PoisonIvy wasn't the first RAT. So what? Does it get more ethical to assist fraudsters and so on once more people are doing it?
Ukv 7 days ago |
I'd claim the way most people imagine it being used for scamming, cold-calls impersonating someone the victim knows, doesn't really end up working out in practice because scam operations dial numbers at a huge scale expecting most not to pick up a "scam likely" call (or be away, or a dead number, etc.). Having to find a voice clip prior to each unanswered call would tank the quantity they're able to make.
For spear-phishing (impersonate CEO, tell assistant to transfer money) it's more feasible, but I hope it forces acceptance that "somebody sounds like X over the phone" is not and has never been a good verification method - people have been falling for scams like those fake ransom calls[0] for decades.
Not that there aren't potential harms, but I think they're outweighed by positive applications. Those uncomfortable with their natural voice, such as transgender people, can communicate closer to how they wish to be perceived - or someone whose voice has been impaired (whether just a temporary cold or a permanent disorder/illness/accident) can use it from previous recordings. Privacy benefits from being able to communicate online or record videos without revealing your real voice, which I think is why many (myself included) currently resort to text-only. There's huge potential in the translation and vocal isolation aspects aiding communication - feels to me as though we're heading towards creating our own babelfish. There's also a bunch of creative applications - doing character voices for a D&D session or audiobook, memes/satire, and likely new forms of interactive media (customised movies, audio dramas where the listener plays a role, videogame NPCs that react with more than just prereccorded lines, etc.)
[0]: https://www.fbi.gov/news/stories/virtual-kidnapping
yyuugg 7 days ago |
I think most people in America are more wary of foreign sounding voices. If the person on the other end sounds like a good ol boy, they get more trust.
Scammers don't have to sound like a specific person to be helped by software like this.
Ukv 6 days ago |
That aspect feels to me like "I used to racially profile people on the street to judge risk, but winter clothing now obscures skin color at a distance". There are heuristics that give non-zero information but are harmful to use, with the cost borne by some marginalized group, and I don't see it as a negative for use of such heuristics to be made less feasible. Reducing people's use of accent as a factor would be a positive for the ~1.5B Indians that aren't scammers, for instance.
I think there's also an autonomy argument to be made, if the alternative is to the effect of ensuring that people cannot use tools hide their accent (and particularly if, as above, the intent is so they can be discriminated against based on it). Even though it isn't something we've really been able to do before, I think it's generally a person's own right to modify their voice.
aboardRat4 7 days ago |
Without Linux support it is going to have a very limited audience.
okwhateverdude 7 days ago |
There is nothing in here that precludes you from running this on any OS that supports python + CUDA. They use miniconda for installation of python and python packages, but this could just as easily be a venv + system CUDA install or even better: a container. This is only one tiny Dockerfile away from running anywhere.
vunderba 7 days ago |
I do think that voice cloning for personal usage has actual genuine uses - in fact there was a relatively interesting news article about a person who was irrevocably losing their voice who had their vocal pattern cloned.
https://www.voanews.com/a/illness-took-away-her-voice-ai-cre...
That being said, it does seem a bit bizarre that the repo's home page is proudly trumpeting the ability to co-opt other people's identities without their permission (and yes your unique vocal pattern is definitely part of your identity - I mean it's used in some forms of biometric data). They're doing the project a bit of a disservice.
onetokeoverthe 7 days ago |
proudly trumpeting the ability to co-opt other people's identities without their permission
EXACTLY. Clone the wrong person's voice and it's game over.
satvikpendem 7 days ago |
It's useful for some things, like satire. Presidents Play is a good series in YouTube where it uses US presidents' cloned voices for comedic satire.
bbarnett 7 days ago |
A gun is useful to shoot someone, what has that to do with it being right or wrong?
satvikpendem 7 days ago |
Not sure you picked the most cogent example because lots of people will debate you on that topic...
VPenkov 7 days ago |
It does have actual genuine uses. I'm in the process of recording a series of tutorials for my peers but I'd like them to hear things in my voice so it doesn't sound like I have offloaded the work to someone else.
I don't know if this helps or harms the credibility but I can't really talk more than an hour without seriously straining my voice. So cloning it sounds like a great use-case for someone with a similar problem.
Looking forward to trying this.
vunderba 6 days ago |
I like this idea. I've been playing with the idea of having all my blog entries have corresponding narration with my own voice but I'd love to see some kind of voice cloner + gradio interface that let's me make some adjustments to things like cadence, delivery, etc. (I mean beyond just making me sound like Alvin and the Chipmunks).
drillsteps5 6 days ago |
Wondercraft.ai It's not mine, just used it for a bit few months ago.
TravisPeacock 5 days ago |
I don't know about changing tone but I have used Adobe Podcast editor and it allows you adjust the words and rearrange what you said so you can cut "umms" and stuff. I know they are constantly adding features so I don't know if you can improve cadence and stuff but worth looking at if you have adobe stuff
dotancohen 6 days ago |
> so it doesn't sound like I have offloaded the work to someone else.
So, deception. Deception that you feel is justified, but deception nonetheless.
VPenkov 5 days ago |
I disagree. Deception is the act of convincing one in untrue information.
The information I'm conveying is truthful and it's my words. The voice, generated or not, is not what I'm trying to convince people into believing.
dotancohen 4 days ago |
That is an interesting perspective. I disagree that there is no deception, but do see the validity in your point. Thank you.
NoMoreNicksLeft 6 days ago |
When my IoT geiger counter starts going off, I do what the in-home PA system's voice to be Admiral Adama warning my family of an imminent radiological threat, and preparing the Vipers for launch.
Edward James Olmos if you're reading this, I'm willing to pay a license fee, but then I expect actual recordings and not just AI bullshit. I'm not pirating your voice, you're refusing to let me hire it.
ranger_danger 6 days ago |
Randy Travis also used AI on his last album after losing his voice.
chefandy 6 days ago |
Of course there are legitimate uses, which means everyone should have completely unfettered access and nobody selling it should be responsible for irresponsible users. Personally, I’m sick of the government limiting my artistic freedom because the mediums I use might be misused by a tiny group of bad actors. For example, it’s unnecessarily difficult to source pineapple grenades for my large scale abstract punched tin crafts. The other people who live in my apartment building haven’t complained when I asked if they had a problem with it, so what’s the problem? And when I can get ahold of it, white phosphorous makes a great addition to my annual deep-woods pyrotechnic light shows. I just don’t understand this nanny state garbage.
notpachet 6 days ago |
Take my upvote you greedy bastard.
chefandy 6 days ago |
Taken, as recommended. It tingled.
mensetmanusman 6 days ago |
Polonium has useful uses
chefandy 6 days ago |
If nothing else, I can confirm it’s delicious.
Grimblewald 6 days ago |
Right? I am an avid keeper of terrariums and micro-ecosystems. Government over-reach means I am having a really hard time seeding my anthrax enclosure.
chefandy 6 days ago |
Ridiculous! Complete overreach. Be strong, oppressed one. “This too shall pass.”
wingworks 7 days ago |
Just heads up, this is a trail, you have to pay to use it after 30mins..
Easier and (cheaper?) to just use elevenlabs.
vulcanidic 7 days ago |
It’s a bit of a hassle, but after closing the Windows command, you can restart the program and use it indefinitely. The results you worked on will still remain in the workspace folder.
ldoughty 7 days ago |
Yeah, felt like it positions itself as open source project here and on GitHub, but buries the cost in other pages... Doesn't even say the subscription cost anywhere I could find (in English). Not a huge fan of this advertising model.
jamesy0ung 7 days ago |
I haven’t looked at the code, but can you just patch out the 30 minute limit?
batch12 7 days ago |
Looks to me like the app code is compiled into pyd files. One could try and decompile. Interestingly, it's licensed as MIT.
XorNot 7 days ago |
The real utility of something like this is for reducing the creative costs of voice-acting. i.e. something like this is a massive boone for mod-makers where making fully voiced anything is a huge undertaking - i.e. while my friends and family could probably provide their voice if I asked, getting a decent recording and performance out of them is just not going to be possible.
But if I can get the performance I want and shift it to another voice, then fully voicing free works becomes very accessible (even better would be generative AI which could take a sample of what you want and re-render it into something which sounds like a more professional performance - voice in-fill I suppose).
youngNed 7 days ago |
I'm looking down the comments, but not really seeing much about what this actually is, by my very quick look, it's a front end for f5-tts with a yt-dlp and whisper?
Is there anything new in this?
dangoodmanUT 7 days ago |
Yeah they made an easy to use frontend. Don't be the dropbox guy
vulcanidic 7 days ago |
I completely agree with you. This is just a web front-end, and there's nothing new about it. However, it's very easy. It's not easy to create something like this.
Uehreka 6 days ago |
We can't just keep saying "Don't be the dropbox guy" as a comeback to criticism of new technology. Anyone who uses that phrase should have to place a bet in a prediction market that only pays out if the product they're talking about succeeds. Blindly supporting stuff out of a sort of "Pascal's Wager against looking foolish later" should have some cost if you're wrong.
bn-l 6 days ago |
Let’s default to being supportive and very careful with being negative.
Uehreka 6 days ago |
That kind of imbalance makes it easier for scammers and hucksters to get away with things. It is not a feelgood prescription with no cost.
bn-l 2 days ago |
This is another cost of scamming: the cynicism it creates.
youngNed 6 days ago |
Wind your neck in.
I simply asked "is there anything new in this?" because, i was interested to know if, you know, there was anything new in this.
mensetmanusman 6 days ago |
It’s wrappers all the way down
OceanBreeze77 7 days ago |
Are banks moving away from voice verification as a means to identity checks? It seems like it's getting easier and easier to clone voices.
tgv 7 days ago |
I'm with the nay-sayers. Your product doesn't bring any good to this world, but it does make it easier to harm people. It's a disgrace.
Ylpertnodi 7 days ago |
"If, by whiskey...."
Hard_Space 7 days ago |
This doesn't appear to have any training facility, so its misuse would seem to be limited to the pre-trained voices supplied - for the casual user (and the ease-of-use seems to be the central issue in these comments).
throwaway314155 6 days ago |
My experience with voice cloning is that training is typically not required for it to work. You just embed a bit of audio of the desired voice to be cloned using the backing VAE and the model can do the rest.
Is it not the same with this project?
deskr 7 days ago |
Isn't it funny how some text changes the voice in your head? Now you're hearing the best voice. It's amazing. I tell you. It's the greatest voice. Everybody’s talking about it. They are saying it's incredible. They say they've never heard as beautiful a voice before.
cies 7 days ago |
I needed until "Everybody’s talking about it" to hear it in his voice :)
Please no spoilers!
amazingamazing 6 days ago |
Voices can be beautiful.
bitwize 6 days ago |
When Arnold Schwarzenegger was governor of California, he refused clemency for notorious gang founder Stanley "Tookie" Williams, who was sentenced to death for four murders in 1979.
https://www.ocregister.com/2005/12/12/governors-full-stateme...
Reading over the governor's statement explaining his reasons for denial of clemency, my brain couldn't help but do so in an Arnold voice. Sometimes, to amuse friends, I would read portions of it aloud while doing the voice.
Maybe it's a bit tasteless, like the anime-girl Demon Core memes, but there's just something about hearing the legal and administrative justification for proceeding with an execution in the voice of the Terminator.
I'm the same way with famous YouTubers. If I see "Guru Larry" Bundy Jr. or Clint "LGR" Basinger leave a comment on someone else's video, my brain reads it in their voice.
giarc 7 days ago |
My neighbour is a detective and did a course on crypto scams. He told me scammers call someone's cell phone, record their voicemail greeting and use that to clone their voice. Then can then have a very real life conversation with their grandparent and take their money.
I'm all for innovation, but I don't really see the use case of cloning random voices to make podcasts? Listening to Zuck interview Elon? ok...?
eurekin 7 days ago |
Technically, wouldn't a simple "Hold on, I'll call you back" test call stop that?
stitched2gethr 7 days ago |
Yes, if the callee has reason to believe the caller isn't who they say they are. But this will never enter the mind of someone who's retirement age.
bagels 6 days ago |
Some old people become very gullible.
Loughla 6 days ago |
In all fairness, the number of old people who even know that realistic recreations of their loved ones voices is even possible is probably pretty low.
a2128 6 days ago |
Scammers will use pressure and emotion. "Grandpa they put me in jail, I need you to bail me out please, there's not much time!" The last thing on the victim's mind is to hang up on what sounds like their crying distressed grandson to call them back. Sometimes even calling back won't work, the real grandson isn't picking up their phone and the scammer is saying that it's because they're in jail and their phone was taken.
botanical76 6 days ago |
I've been thinking a lot about this possibility. I think people will have to come up with family passwords eventually. A word or phrase that is regularly practised, but strictly private, for verification in times of crisis.
For example, my family's passphrase is- just kidding.
hollerith 6 days ago |
Either than or Android and iOS will add something like Caller ID but with actual authentication.
notpachet 6 days ago |
My family already does this.
ssl-3 6 days ago |
Mmm. Safewords.
alias_neo 6 days ago |
It's really easy for a technical person to do as well.
I use Coqui TTS[0] as part of my home automation, I wrote a small python script that lets me upload a voice clip for it to clone after I got the idea from HeyWillow[1], and a small shim that lets me send the output to a Home Assistant media player instead of using their standard output device. I run the TTS container on a VM with a Tesla P4 (~£100 to buy) and get about 1x-2x (roughly the same time it'd take to say it, to process) using the large model.
Just for a giggle, I uploaded a few 3s-5s second clip of myself speaking and cloned my voice, then executed a command to our living room media player to call my wife into the room; from another room, she was 100% convinced it was myself speaking words I'd never spoken.
I tried playing with a variety of sentences for a few hours and overall, it sounded almost exactly like me, to me, with the exception of some "attitude" and "intonation" I know I wouldn't use in my speech. I didn't notice much of an improvement using much longer clips; the short ones were "good enough".
Tangentially, it really bugs me that most phone providers in the UK insist you record a "personal greeting" now before they'll let you check your voice mail box, I just record silence, because the last thing I want/need is a voicemail greeting in my voice confirming to some randomer I didn't want calling me, who I am and that my number is active, even more so knowing how I can clone any voice to a reasonably good accuracy with just a few seconds of audio.
[0] https://github.com/coqui-ai/TTS [1] https://heywillow.io/
mensetmanusman 6 days ago |
The best thing about crypto is that it is an ever growing bug bounty program for all aspects of authentication :)
morkalork 7 days ago |
Just need to use this with some recordings of Majel Barrett, make a voice interface for Claude's computer use agent and we'll be all set.
pmarreck 6 days ago |
> Linux and Mac OS are not supported
Well, that's a big old fail. Just a reminder: The given (and proper) home of open source is on an open source OS.
bguberfain 6 days ago |
Thanks for sharing this! But I have some doubts about hidden installation procedures. It imports all functions from one_click (from one_click import *), which points to a compiled file. It then runs functions like install_webui and install_extra_packages. At least suspicious.
vulcanidic 6 days ago |
Try recording the installation process with a camera. The entire installation process is displayed in the Windows command. It's just installing Python packages and downloading the AI model and audio files. That's all.
didibus 6 days ago |
Pretty easy for a script to not print everything it does at the command line. You have to inspect the code if you want to be sure.
bguberfain 6 days ago |
The file I mentioned is just the begining... there is a folder full of .dll files, renamed to .pyd. I understand that this is the proprietary part, that limits usage for 30 minutes, but I think it is too closed for a MIT license.
lysace 6 days ago |
I have resorted to using separate physical computers + vlan network separation when exploring untrusted AI workloads. Yes, it costs, but so does a breach.
Thanks for raising this aspect.
Btw https://github.com/haimgel/display-switch helps a lot.
nmstoker 6 days ago |
Perhaps I'm paranoid but this has multiple red flags that make hesitate to install - even the "too good to be true" aspect of such comprehensive features makes me wonder (which is probably irrational and taking it a bit too far!)
wutwutwat 6 days ago |
> Windows Defender may give a warning about untrusted application and disallow further execution of Voice-Pro. If SmartScreen security level is set to "Warn", just click "More info" and then click "Run anyway". If SmartScreen is set to level "Block" there will be no button to run the installation. In this case, open the properties of the start.bat file, and check "Unblock", apply the change and run the start.bat again.
https://github.com/abus-aikorea/voice-pro?tab=readme-ov-file...
clear as day, do not trust this code
Grimblewald 6 days ago |
Exactly, and this isn't adding anything significant from what I can see that isn't already achieved in much more clear and openly presented repositories. Take coqui for example. Cloning as as easy as recording an example of your voice and using
```python from TTS.api import TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True) # generate speech by cloning a voice using default settings tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", file_path="output.wav", speaker_wav="/path/to/target/ speaker.wav", language="en") ```
jordimash 6 days ago |
If you are looking for automatic dubbing without voice cloning: https://github.com/Softcatala/open-dubbing
kristopolous 6 days ago |
The syncing of the original English is way off. I don't really know how they got that to be so broken.
sroussey 6 days ago |
The description, since many commentators are not clicking though but asking questions this answers:
Comprehensive Gradio WebUI for audio processing, powered by Whisper engines (Whisper, Faster-Whisper, Whisper-Timestamped). Features Voice Changer, zero-shot Voice Cloning (E2, F5-TTS), YouTube downloading, vocal isolation(UVR5), Text-to-Speech (Edge-TTS), and multi-language translation. Perfect for content creators and developers.
owlbynight 6 days ago |
Honestly, I'm not super worried about AI — at least this iteration of it — because of the uncanny valley effect. I would expect the VO industry to outlaw it purely because if people start to wonder if they're listening to an AI voice, that's a non-starter and they will stop paying attention. Even with the best AI, there are artifacts that make it easy to identify.
The primary goal of the voice actor is to achieve a personal connection, and I don't see how AI is a real threat to that end. I feel the same about other mediums as well. This will likely be used for scams, but I doubt it will ever draw as many eyes, or ears, as something a real human can produce. Thus, it won't be a valuable tool to marketers and largely unprofitable.
patrickhogan1 6 days ago |
This is cool. I want to use this combined with NotebookLM to create a podcast with my mom and my dad’s voice covering a concept like gradient descent, Explain Like I’m 5 (ELI5).
I wonder if certain familiar voices like that of your parents would lead to higher understanding and retention.
wutwutwat 6 days ago |
> Windows Defender may give a warning about untrusted application and disallow further execution of Voice-Pro. If SmartScreen security level is set to "Warn", just click "More info" and then click "Run anyway". If SmartScreen is set to level "Block" there will be no button to run the installation. In this case, open the properties of the start.bat file, and check "Unblock", apply the change and run the start.bat again.
https://github.com/abus-aikorea/voice-pro?tab=readme-ov-file...
hard pass and anyone who reads this and continues is bonkers
HPsquared 6 days ago |
Doesn't that basically apply to all binary executables? Anything new and unrecognised by the scanner.
wutwutwat 6 days ago |
There's also the fact that there's a load of precompiled binary files in the app directory https://github.com/abus-aikorea/voice-pro/tree/main/app - sure, they might be the binaries from compiling the source code you see in the repo, or they might be something else. Roll the dice.
vulcanidic 6 days ago |
Running a Python application using a Windows batch file is not a special task at all. Oobabooga and AUTOMATIC1111 work in the same way. They also have the same issues regarding Windows Defender.
https://github.com/oobabooga/text-generation-webui https://github.com/AUTOMATIC1111/stable-diffusion-webui
nulld3v 6 days ago |
They are complaining about the binary files, not the batch files.
vulcanidic 6 days ago |
This application is executed in a virtual environment (venv) created using Miniconda, independent of the Windows OS. It does not damage the Windows OS.
If you have concerns or doubts about telemetry or spyware, there are countless software options available for detection. Give it a try.
twojacobtwo 6 days ago |
I'm legitimately wondering what you recommend from among those options.
wutwutwat 6 days ago |
I don't have concerns, because I won't be running this code
nulld3v 6 days ago |
Python venvs are not intended to provide isolation from the host system and therefore do not provide any isolation from the host system.
So yes, the app can certainly harm the OS, and the venv would not provide any protection against this.
dotancohen 6 days ago |
This is absolutely not true. In fact, considering the deception of this post, if the person making this post is associated with the project then the project should be considered malware.
totallymike 6 days ago |
This is gross. The person who made it and pitched it this way disgusts me.