Alongside cooking for myself and strolling laps round the home, Japanese cartoons (or “anime” as the children are calling it) are one thing I’ve realized to like throughout quarantine.
The downside with watching anime, although, is that in need of studying Japanese, you develop into depending on human translators and voice actors to port the content material to your language. Sometimes you get the subtitles (“subs”) however not the voicing (“dubs”). Other instances, total seasons of reveals aren’t translated in any respect, and also you’re left on the sting of your seat with solely Wikipedia summaries and 90s internet boards to ferry you thru the darkness.
So what are you presupposed to do? The reply is clearly to not ask a pc to transcribe, translate, and voice-act total episodes of a TV present from Japanese to English. Translation is a cautious artwork that may’t be automated and requires the loving contact of a human hand. Besides, even should you did use machine studying to translate a video, you couldn’t use a pc to dub… I imply, who would need to hearken to machine voices for a complete season? It’d be terrible. Only an actual sicko would need that.
So on this publish, I’ll present you the way to use machine studying to transcribe, translate, and voice-act movies from one language to a different, i.e. “AI-Powered Video Dubs.” It won’t get you Netflix-quality outcomes, however you need to use it to localize on-line talks and YouTube movies in a pinch. We’ll begin by transcribing audio to textual content utilizing Google Cloud’s Speech-to-Text API. Next, we’ll translate that textual content with the Translate API. Finally, we’ll “voice act” the translations utilizing the Text-to-Speech API, which produces voices which can be, in accordance with the docs, “humanlike.”
(By the way in which, earlier than you flame-blast me within the feedback, I ought to let you know that YouTube will automatically and for free transcribe and translate your movies for you. So you’ll be able to deal with this challenge like your new interest of baking sourdough from scratch: a extremely inefficient use of 30 hours.)
AI-dubbed movies: Do they often sound good?
Before you embark on this journey, you most likely need to know what it’s a must to sit up for. What high quality can we realistically anticipate to realize from an ML-video-dubbing pipeline?
Here’s one instance dubbed robotically from English to Spanish (the subtitles are additionally robotically generated in English). I haven’t carried out any tuning or adjusting on it:
As you’ll be able to see, the transcriptions are first rate however not good, and the identical for the translations. (Ignore the truth that the speaker typically speaks too quick — extra on that later.) Overall, you’ll be able to simply get the gist of what’s happening from this dubbed video, but it surely’s not precisely close to human-quality.
What makes this challenge trickier (learn: extra enjoyable) than most is that there are a minimum of three attainable factors of failure:
- The video could be incorrectly transcribed from audio to textual content by the Speech-to-Text API
- That textual content could be incorrectly or awkwardly translated by the Translation API
- Those translations could be mispronounced by the Text-to-Speech API
In my expertise, essentially the most profitable dubbed movies had been those who featured a single speaker over a transparent audio stream and that had been dubbed from English to a different language. This is essentially as a result of the standard of transcription (Speech-to-Text) was a lot increased in English than in different supply languages.
Dubbing from non-English languages proved considerably tougher. Here’s one significantly unimpressive dub from Japanese to English of one in every of my favourite reveals, Death Note:
If you need to go away translation/dubbing to people, nicely–I can’t blame you. But if not, learn on!
Building an AI Translating Dubber
As at all times, you will discover all the code for this challenge within the Making with Machine Learning Github repo. To run the code your self, observe the README to configure your credentials and allow APIs. Here on this publish, I’ll simply stroll by means of my findings at a excessive degree.
First, listed here are the steps we’ll observe:
- Extract audio from video information
- Convert audio to textual content utilizing the Speech-to-Text API
- Split transcribed textual content into sentences/segments for translation
- Translate textual content
- Generate spoken audio variations of the translated textual content
- Speed up the generated audio to align with the unique speaker within the video
- Stitch the brand new audio on prime of the fold audio/video
I admit that after I first got down to construct this dubber, I used to be stuffed with hubris–all I needed to do was plug a number of APIs collectively, what may very well be simpler? But as a programmer, all hubris should be punished, and boy, was I punished.
The difficult bits are those I bolded above, that primarily come from having to align translations with video. But extra on that in a bit.
Using the Google Cloud Speech-to-Text API
The first step in translating a video is transcribing its audio to phrases. To do that, I used Google Cloud’s Speech-to-Text API. This instrument can acknowledge audio spoken in 125 languages, however as I discussed above, the standard is highest in English. For our use case, we’ll need to allow a few particular options, like:
- Enhanced models. These are Speech-to-Text fashions which were skilled on particular information sorts (“video,” “phone_call”) and are often higher-quality. We’ll use the “video” mannequin, in fact.
- Profanity filters. This flag prevents the API from returning any naughty phrases.
- Word time offsets. This flag tells the API that we wish transcribed phrases returned together with the instances that the speaker mentioned them. We’ll use these timestamps to assist align our subtitles and dubs with the supply video.
- Speech Adaption. Typically, Speech-to-Text struggles most with unusual phrases or phrases. If sure phrases or phrases are more likely to seem in your video (i.e. “gradient descent,” “support vector machine”), you’ll be able to go them to the API in an array that may make the extra more likely to be transcribed:
The API returns the transcribed textual content together with word-level timestamps as JSON. As an instance, I transcribed this video. You can see the JSON returned by the API in this gist. The output additionally lets us do a fast high quality sanity verify:
What I really mentioned:
“Software Developers. We’re not known for our rockin’ style, are we? Or are we? Today, I’ll show you how I used ML to make me trendier, taking inspiration from influencers.”
What the API thought I mentioned:
“Software developers. We’re not known for our Rock and style. Are we or are we today? I’ll show you how I use ml to make new trendier taking inspiration from influencers.”
In my expertise, that is in regards to the high quality you’ll be able to anticipate when transcribing high-quality English audio. Note that the punctuation is somewhat off. If you’re proud of viewers getting the gist of a video, that is most likely ok, though it’s straightforward to manually right the transcripts your self should you converse the supply language.
At this level, we are able to use the API output to generate (non-translated) subtitles. In reality, should you run my script with the `–srt` flag, it can do precisely that for you (srt is a file kind for closed captions):
Now that we have now the video transcripts, we are able to use the Translate API to… uh… translate them.
This is the place issues begin to get somewhat 🤪.
Our goal is that this: we wish to have the ability to translate phrases within the authentic video after which play them again at roughly the identical cut-off date, in order that my “dubbed” voice is talking in alignment with my precise voice.
The downside, although, is that translations aren’t word-for-word. A sentence translated from English to Japanese could have a phrase order jumbled. It could include fewer phrases, extra phrases, completely different phrases, or (as is the case with idioms) fully completely different wording.
One manner we are able to get round that is by translating total sentences after which making an attempt to align the time boundaries of these sentences. But even this turns into sophisticated, as a result of how do you denote a single sentence? In English, we are able to break up phrases by punctuation mark, i.e.:
But punctuation differs by language (there’s no ¿ in English), and a few languages don’t separate sentences by punctuation marks in any respect.
Plus, in real-life speech, we frequently don’t discuss in full sentences. Y’know?
Another wrinkle that makes translating transcripts tough is that, normally, the extra context you feed right into a translation mannequin, the upper high quality translation you’ll be able to anticipate. So for instance, if I translate the next sentence into French:
“I’m feeling blue, but I like pink too.”
I’ll get the interpretation:
“Je me sens bleu, mais j’aime aussi le rose.”
This is correct. But if I break up that sentence in two (“I’m feeling blue” and “But I like pink too”) and translate every half individually, I get:
“Je me sens triste, mais j’aime aussi le rose”, i.e. “I’m feeling sad, but I like pink too.”
This is all to say that the extra we chop up textual content earlier than sending it to the Translate API, the more serious high quality the translations will probably be (although it’ll be simpler to temporally align them with the video).
Ultimately, the technique I selected was to separate up spoken phrases each time the speaker took a greater-than-one-second pause in talking. Here’s an instance of what that appeared like:
This naturally led to some awkward translations (i.e. “or are we” is a bizarre fragment to translate), however I discovered it labored nicely sufficient. Here’s the place that logic seems to be like in code.
Side bar: I additionally seen that the accuracy of the timestamps returned by the Speech-to-Text API was considerably much less for non-English languages, which additional decreased the standard of Non-English-to-English dubbing.
And one final thing. If you already know the way you need sure phrases to be translated (i.e. my identify, “Dale,” ought to at all times be translated merely to “Dale”), you’ll be able to enhance translation high quality by benefiting from the “glossary” characteristic of the Translation API Advanced. I wrote a weblog publish about that here.
The Media Translation API
As it occurs, Google Cloud is engaged on a brand new API to deal with precisely the issue of translating spoken phrases. It’s known as the Media Translation API, and it runs translation on audio instantly (i.e. no transcribed textual content middleman). I wasn’t ready to make use of that API on this challenge as a result of it doesn’t but return timestamps (the instrument is presently in beta), however I feel it’d be nice to make use of in future iterations!
Now for the enjoyable bit–choosing out pc voices! If you examine my PDF-to-Audiobook converter, that I like me a funny-sounding pc voice. To generate audio for dubbing, I used the Google Cloud Text-to-Speech API. The TTS API can generate plenty of completely different voices in numerous languages with completely different accents, which you will discover and play with here. The “Standard” voices may sound a bit, er, tinny, if what I imply, however the WaveNet voices, that are generated by high-quality neural networks, sound decently human.
Here I bumped into one other downside I didn’t foresee: what if a pc voice speaks lots slower than a video’s authentic speaker does, in order that the generated audio file is just too lengthy? Then the dubs could be not possible to align to the supply video. Or, what if a translation is extra verbose than the unique wording, resulting in the identical downside?
To take care of this problem, I performed round with the
speakingRate parameter accessible within the Text-to-Speech API. This lets you velocity up or decelerate a pc voice:
So, if it took the pc longer to talk a sentence than it did for the video’s authentic speaker, I elevated the speakingRate till the pc and human took up about the identical period of time.
Sound somewhat sophisticated? Here’s what the code seems to be like:
This solved the issue of aligning audio to video, but it surely did typically imply the pc audio system in my dubs had been somewhat awkwardly quick. But that’s an issue for V2.
Was it value it?
You know the expression, “Play stupid games, win stupid prizes?” It seems like each ML challenge I construct right here is one thing of a labor of affection, however this time, I like my silly prize: the flexibility to generate a limiteless variety of bizarre, robotic, awkward anime dubs, which can be typically kinda first rate.
Check out my outcomes right here: