What sounds like science fiction is now reality.
Voice-Pro is an open-source Gradio WebUI that breaks the boundaries of audio manipulation.
Powered by cutting-edge Whisper engines, this tool turns voice replication into child's play.
Key Features:
- Zero-shot Voice Cloning
- Voice Changer with 50+ Celebrity Voices
- YouTube Audio Downloading
- Vocal Isolation
- Multi-Language Text-to-Speech (Edge-TTS, F5-TTS)
- Multi-Language Translation
- Powered by Whisper Engines (Whisper, Faster-Whisper, Whisper-Timestamped)
Video Demos:
1. Voice-Pro Usage Tutorial: https://youtu.be/z8g8LMhoh_o
2. Voice Cloning Celebrity Podcast Demo: https://youtu.be/Wfo7vQCD4no
3. Full Demo Playlist: https://www.youtube.com/playlist?list=PLwx5dnMDVC9Y7dAjm9r26...
Whether you're a content creator, developer, or audio experiment enthusiast,
Voice-Pro provides a user-friendly interface to push the boundaries of audio manipulation.
on edit: example of course showing the difficulty as so much of Hepburn was her inflection.
“I’m going to kill you” could be delivered (laughing jokingly / seething with rage / ominously and creepily). I’d like a bot that can mimic the delivery in a different voice.
For some reason because we can validates that we should. Any jackass has the power of a research team of phds. It's kinda weird.
Governments successfully collectively controlling dangerous things so they don’t fall into the hands of rogue bad actors fundamentally opposes the extreme individualist every-man-for-himself perspective in every conceivable way. It’s the absolute opposite of “it’s everybody’s responsibility to protect themselves because everybody else is only going to look out for themselves.”
And when individuals have that much leverage, collective action is the only conceivable way to oppose it. Some of those things might be cultural, like mores, some might be laws, some might be more martial. I don’t see how extreme individualism even theoretically could be more powerful.
There is a massive problem with this on youtube. Pretty much every category on youtube now has a host of these bots trolling content and playing the youtube strike system like a banjo. There are channels detected to showing you how to setup these content mills. This tool can make you good money.
I think you mean "steal the labor of an actor"?
As far as I know, most countries are lagging behind when it comes to updating legislation to set binding rules around that.
This assumes existence of a license agreement or likeness/right of publicity law that prevents unauthorized use. But this is far from the case.
Companies have shown willingness to use actors’ voices to create synthetic voices without permission, compensation, or regard for their livelihoods. [1][2][3]
[1] https://animehunch.com/popular-japanese-voice-actors-band-to...
[2] https://www.theatlantic.com/technology/archive/2024/05/eleve...
[3] https://www.yahoo.com/entertainment/morgan-freeman-calls-una...
You can regulate large companies, you can regulate published software sold for profit, but it's impossible to regulate free and open source tools.
You essentially have to regulate access to computing power if you want to prevent bad actors doing bad things using these sort of tools.
Regulation is putting legal limitations on things, if it is impossible to regulate free and open source tools then it would be impossible to regulate murder and lots of other things, but it turns out it isn't impossible, sure - murder happens - but people get caught for it and punished.
Sorry, but this argument is much like the early internet triumphalism - back when people said it was impossible to regulate. Turns out lots of countries now regulate it.
I'm also not sure what's so regulated about the internet besides net neutrality in certain countries. Of course the government can put limits on the network, like banning services, but it's easy since they are rather easy to target. With content traveling on the network it's much harder to say if it's legit or not.
> lots of countries
What about those countries that don't regulate it and people will keep pumping out better, leaner and faster models from there? Spreading software is trivial, all you achieve is the public won't be aware of what's possible.
The more I think about it if anything should be regulated that's a requirement to provide third party (probably government backed) ID verification system so it would be possible for my mom to know it's me calling here. Basically kill called ID spoofing.
generally things are regulated on the internet that were not going to ever be regulated because it was on the internet - example - sales taxes, perhaps you are old enough to remember when sales tax collection would not ever be enforceable on internet transactions - those idiot lawyer don't know, it's on the internet, the sale didn't happen in that country or in that state no sales taxes will never happen on the internet hah hah. It's unenforceable, it is logically undoable, there are so many edge cases - ugh, the law just does not understand technology!
oops, sales taxes now on internet purchases.
GDPR is another example of things that are regulated on the internet that basically most of HN years before it happened was completely convinced would be impossible!!
If this thing becomes too big a problem for the societies regulations will be done, with varying levels of effectiveness I'm sure.
And then in twenty years time we will be saying what, you can't regulate genital eating viral synths because a guy can make those in his garage and spread them via nasal spray, this technology is unstoppable and unregulatable, not like some open source deepfake library!!
Obligatory/relevant xkcd: https://xkcd.com/538/
The closest thing I can think of is maybe the regulation of DRM ripping tools, but they're still out there in the wild and determined actors can easily get ahold of them. So I'm not at all confident that regulation will have any measurable meaningful effect.
The "determined actor" can get bombs, tanks, fissure material. There noone says "WHELP they can get it anyway so why bother regulating it LMAO" - somehow this is different in anything not physical?
that something is not currently regulated does not mean it can never be regulated, further it does not seem likely that they would regulate open source tooling but rather some uses and if they open source tooling allowed those uses then what would happen is -
github and other big sources of code would refuse to host it as containing not legally allowed things, so for example if they regulated it in the U.S then Github stops allowing it, and everyone moves to some European git provider.
At the same time bigger companies will stop using the library because liability.
Europe then regulates and can't be in European git repos.. at some point many devs abandon particular library because not worth it (I get it this is actually for the love of doing the illegal thing so they won't abandon but despite the power of love most things in this world do not actually run on it)
Can determined actors get ahold of them and do the things with them the law forbids them to do, sure! That's called crime. Then law enforcement catches determined actors and puts them in prison, that's called the real world!
Will criminals stop - nope because there is benefit to what they're doing. Maybe some will stop because they will think screw it I can make more money working for the man. And some will be caught sooner or later. And maybe in version two of the regulations there will be AI enhancements - this crime was committed with AI allowing us to take all your belongings and add 10 years to your sentence and deprive you of the right to ever own a computing device again...etc. etc. And some people will stop and others will get more violent and aggressive about their criminal business.
I don't know necessarily what measurable meaningful effect means, for some people it will be measurable and meaningful, for some not, for some of society the regulation would in many ways be worse than what it is fighting against. I'm not saying regulation will solve problems 100%, I'm just saying this whole they can't regulate us thing because "TECH!!!" that developers seem to regularly go through with anything they set their eye on is a pipe dream.
BS. Can you imagine a legislation? Yes, thus it can be done.
As an early example, the CRA (Cyber Resilience Act) already contains provisions about open source stewards and security. So far they are legal persons, aka foundations, but could easily relate to any contributor or maintainer.
Seriously, what can anybody do about random hacker Joe publishing under the name XoX? Even if they burn GitHub and friends to the ground, if something is useful it will be really really hard to get rid of it. Remember youtube-dl? It's now https://github.com/yt-dlp/yt-dlp
If they make anything that cripples open source development they will feel it quite soon when they realize that it also cripples their world as much of the tooling and infrastructure also depends on it.
Killing open source is like killing the internet itself.
Your example with yt-dl doesn't matter.
Open source/free software inherently relies on copyright and all state legal infrastructure. Once you operate outside, it's no longer open source/free software.
Can you host software in a way that's really hard to block? Sure. There is onion routing and plethora of other options.
But that's no longer open source/free software. You are in a realm of dark web and marketplaces.
I do maintain a semi popular open source project that I took over after about a year of inactivity and I seriously considered quiting because of CRA. It's quite easy to cripple/kill something when it basically runs on volunteering of your free time.
On a forum that frequently discusses technology with enthusiasm you'd think there'd be more enthusiasm and more constructive criticism instead of blanket write-offs.
As for positive applications, some I see:
* Allowing those with speech impairments to communicate using their natural voice again
* Allowing those uncomfortable with their natural voice, such as transgender people, to communicate closer to how they wish to be perceived
* Translation of a user's voice, maintaining emotion and intonation, for natural cross-language communication on calls
* Professional-quality audio from cheap microphone setups (for video tutorials, indie games, etc.)
* Doing character voices for a D&D session, audiobook, etc.
* Customization of voice assistants, such as to use a native accent/dialect
* Movies, podcasts, audiobooks, news broadcasts, etc. made available in a huge range of languages
* If integrated with something like airpods, babelfish-like automatic isolation and translation of any speech around you
* Privacy from being able to communicate online or record videos without revealing your real voice, which I think is why many (myself included) currently resort to text-only
* New forms of interactive media - customised movies, audio dramas where the listener plays a role, videogame NPCs that react with more than just prerecorded lines, etc.
* And of course: memes, satire, and parody
I appreciate HN's general view on technologies like encrypted messaging - not falling into "we need to ban this now because pedophiles could use it" hysteria. But for anything involving machine learning, I'm concerned how often the hacker mentality seems to go out the window and we instead get people advocating for it to be made illegal to host the code, for instance.
Also, notice how the legitimate use-cases 1, 3 and 4 imply the user consenting to clone their own voice, which is fine. However, the only use-case which would require cloning a specific human voice belonging to a third party, use-case 11, is "memes, satire, and parody"... and not much imagination is needed to see how steep and buttery that Teflon slippery slope is.
2, 5, 6, 9: It's true that in theory all you need is some way to capture the characteristics of a desired voice, but voice-cloning methods are the way to do this currently. If you want a voice assistant with a native accent, you fine-tune on the voice of a native speaker - opposed to turning a bunch of dials manually.
7, 8, 10: Here I think there is benefit specifically from sounding like a particular person. The dynamically generated lines of movie characters/videogame NPCs should be consistent with the actor's pre-recorded lines, for instance, and hearing someone in their own voice is more natural for communication and makes conversation easier to follow.
Pedantically, what's promoted here is a tool which features voice cloning prominently but not exclusively - other workflows demonstrated (like generating subtitles) seem mostly unobjectionable.
> Also, notice how the legitimate use-cases 1, 3 and 4 imply the user consenting to clone their own voice, which is fine
I think all, outside of potentially 8 and 11, could be done with full consent of the voice being cloned - an agreement with the movie actor to use their voice for dubbing to other languages, for example. That's already a significant number of use-cases for this tool.
> use-case 11, is "memes, satire, and parody"... and not much imagination is needed to see how steep and buttery that Teflon slippery slope is.
IMO prohibition around satire/parody would be the slippery slope, particularly with the potential for selective enforcement.
MacOS devs blindly trust it like it's the app store.
But I agree that everyone should review the Homebrew install script for any package they're installing if they're concerned about security.
If not, any recommendations for alternative projects?
P.S. Are there any tools for synthetic voice creation? Maybe melding two or more voices together, or just exploring latent space? Would be fun for character creation to create completely new voices.
Game studios will spin up a bunch of unique virtual voices for all the dialogue of extras. It'll probably be longer before we see replacements of main characters though. There's been some research in speech-to-speech transference as well - this means that company employee A records the character B's line with the appropriate emotional nuance (angry, sad, etc.) and the emotional aspect is copied on top of the generated TTS.
I'm imagining it. It sucks to imagine.
I'm imagining it being used to scam people. I'm imagining it to leech off of performers who have worked very hard to build a recognizable voice (and it is a lot of work to speak like a performer). I'm imagining how this will be used in revenge porn. I'm imagining how this will be used to circumvent access to voice controlled things.
This is bad. You should feel bad.
And I know you are thinking, "Wait, but I worked really hard on this!" Sorry, I appreciate that it might be technically impressive, but you've basically come out with "we've invented a device that mixes bleach and ammonia automatically in your bedroom! It's so efficient at mixing those two, we can fill a space with chlorine gas in under 10 seconds! Imagine a world where every bedroom could become a toxic site with only the push of a button.
That this is posted here, proudly, is quite frankly astoundingly embarrassing for you.
For spear-phishing (impersonate CEO, tell assistant to transfer money) it's more feasible, but I hope it forces acceptance that "somebody sounds like X over the phone" is not and has never been a good verification method - people have been falling for scams like those fake ransom calls[0] for decades.
Not that there aren't potential harms, but I think they're outweighed by positive applications. Those uncomfortable with their natural voice, such as transgender people, can communicate closer to how they wish to be perceived - or someone whose voice has been impaired (whether just a temporary cold or a permanent disorder/illness/accident) can use it from previous recordings. Privacy benefits from being able to communicate online or record videos without revealing your real voice, which I think is why many (myself included) currently resort to text-only. There's huge potential in the translation and vocal isolation aspects aiding communication - feels to me as though we're heading towards creating our own babelfish. There's also a bunch of creative applications - doing character voices for a D&D session or audiobook, memes/satire, and likely new forms of interactive media (customised movies, audio dramas where the listener plays a role, videogame NPCs that react with more than just prereccorded lines, etc.)
Scammers don't have to sound like a specific person to be helped by software like this.
I think there's also an autonomy argument to be made, if the alternative is to the effect of ensuring that people cannot use tools hide their accent (and particularly if, as above, the intent is so they can be discriminated against based on it). Even though it isn't something we've really been able to do before, I think it's generally a person's own right to modify their voice.
https://www.voanews.com/a/illness-took-away-her-voice-ai-cre...
That being said, it does seem a bit bizarre that the repo's home page is proudly trumpeting the ability to co-opt other people's identities without their permission (and yes your unique vocal pattern is definitely part of your identity - I mean it's used in some forms of biometric data). They're doing the project a bit of a disservice.
EXACTLY. Clone the wrong person's voice and it's game over.
I don't know if this helps or harms the credibility but I can't really talk more than an hour without seriously straining my voice. So cloning it sounds like a great use-case for someone with a similar problem.
Looking forward to trying this.
> so it doesn't sound like I have offloaded the work to someone else.
So, deception. Deception that you feel is justified, but deception nonetheless.The information I'm conveying is truthful and it's my words. The voice, generated or not, is not what I'm trying to convince people into believing.
Edward James Olmos if you're reading this, I'm willing to pay a license fee, but then I expect actual recordings and not just AI bullshit. I'm not pirating your voice, you're refusing to let me hire it.
Easier and (cheaper?) to just use elevenlabs.
But if I can get the performance I want and shift it to another voice, then fully voicing free works becomes very accessible (even better would be generative AI which could take a sample of what you want and re-render it into something which sounds like a more professional performance - voice in-fill I suppose).
Is there anything new in this?
I simply asked "is there anything new in this?" because, i was interested to know if, you know, there was anything new in this.
Is it not the same with this project?
Please no spoilers!
https://www.ocregister.com/2005/12/12/governors-full-stateme...
Reading over the governor's statement explaining his reasons for denial of clemency, my brain couldn't help but do so in an Arnold voice. Sometimes, to amuse friends, I would read portions of it aloud while doing the voice.
Maybe it's a bit tasteless, like the anime-girl Demon Core memes, but there's just something about hearing the legal and administrative justification for proceeding with an execution in the voice of the Terminator.
I'm the same way with famous YouTubers. If I see "Guru Larry" Bundy Jr. or Clint "LGR" Basinger leave a comment on someone else's video, my brain reads it in their voice.
I'm all for innovation, but I don't really see the use case of cloning random voices to make podcasts? Listening to Zuck interview Elon? ok...?
For example, my family's passphrase is- just kidding.
I use Coqui TTS[0] as part of my home automation, I wrote a small python script that lets me upload a voice clip for it to clone after I got the idea from HeyWillow[1], and a small shim that lets me send the output to a Home Assistant media player instead of using their standard output device. I run the TTS container on a VM with a Tesla P4 (~£100 to buy) and get about 1x-2x (roughly the same time it'd take to say it, to process) using the large model.
Just for a giggle, I uploaded a few 3s-5s second clip of myself speaking and cloned my voice, then executed a command to our living room media player to call my wife into the room; from another room, she was 100% convinced it was myself speaking words I'd never spoken.
I tried playing with a variety of sentences for a few hours and overall, it sounded almost exactly like me, to me, with the exception of some "attitude" and "intonation" I know I wouldn't use in my speech. I didn't notice much of an improvement using much longer clips; the short ones were "good enough".
Tangentially, it really bugs me that most phone providers in the UK insist you record a "personal greeting" now before they'll let you check your voice mail box, I just record silence, because the last thing I want/need is a voicemail greeting in my voice confirming to some randomer I didn't want calling me, who I am and that my number is active, even more so knowing how I can clone any voice to a reasonably good accuracy with just a few seconds of audio.
[0] https://github.com/coqui-ai/TTS [1] https://heywillow.io/
Well, that's a big old fail. Just a reminder: The given (and proper) home of open source is on an open source OS.
Thanks for raising this aspect.
Btw https://github.com/haimgel/display-switch helps a lot.
https://github.com/abus-aikorea/voice-pro?tab=readme-ov-file...
clear as day, do not trust this code
```python
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
file_path="output.wav",
speaker_wav="/path/to/target/ speaker.wav",
language="en")
```
Comprehensive Gradio WebUI for audio processing, powered by Whisper engines (Whisper, Faster-Whisper, Whisper-Timestamped). Features Voice Changer, zero-shot Voice Cloning (E2, F5-TTS), YouTube downloading, vocal isolation(UVR5), Text-to-Speech (Edge-TTS), and multi-language translation. Perfect for content creators and developers.
The primary goal of the voice actor is to achieve a personal connection, and I don't see how AI is a real threat to that end. I feel the same about other mediums as well. This will likely be used for scams, but I doubt it will ever draw as many eyes, or ears, as something a real human can produce. Thus, it won't be a valuable tool to marketers and largely unprofitable.
I wonder if certain familiar voices like that of your parents would lead to higher understanding and retention.
https://github.com/abus-aikorea/voice-pro?tab=readme-ov-file...
hard pass and anyone who reads this and continues is bonkers
https://github.com/oobabooga/text-generation-webui https://github.com/AUTOMATIC1111/stable-diffusion-webui
If you have concerns or doubts about telemetry or spyware, there are countless software options available for detection. Give it a try.
So yes, the app can certainly harm the OS, and the venv would not provide any protection against this.