Businesses from around the globe require fast and reliable ways to transcribe an audio or video file, and often in multiple languages. This audio and video content can range from a news broadcast, call center phone interactions, a job interview, a product demonstration, or even court proceedings. The traditional process for transcription is both expensive and lengthy, often involving the hiring of dedicated staff or services, with a high degree of manual effort. This effort is compounded when a multi-language transcript is required, often leaving customers to over-dub the original content with a new audio track.
Natural language processing (NLP) and translation are some of the hardest problems for computers to solve because of the many contextual and idiomatic elements of speech. This has historically required a native speaker of both the source and target language to perform a translation. The recently introduced neural-network based approach to machine translation has brought on unprecedented translation accuracy and fluency. While it is not perfect yet, it is now a viable solution for many more use cases than ever before, including this one.
At AWS re:Invent 2017, Andy Jassy introduced some new services in the AI/ML group targeted to solve these problems. To demonstrate some practical ways to use these services to solve the subtitling and translation problem, this blog post will focus on using Amazon Transcribe, Amazon Translate, and Amazon Polly to transcribe, translate, and overdub subtitles and alternate voice tracks with the original content.
Let’s take a look at the final product, then we’ll provide details of our solution.
High-level approaches to solve the problem
Generally speaking, there are four high-level steps required to create subtitled, translated media:
-
Obtain a transcript of the content.
-
Translate the transcript into the target language.
-
Create the corresponding subtitles from the transcript or translation and create a subtitles file.
-
Combine all of those pieces into a finished product.
The following diagram illustrates the process in detail:
All of the source code files needed to perform these steps are included in this blog post and can be found at downloaded from here.
AWS services Overview
We’ll use the following services for our use case. I’ll describe the key methods and attributes needed to solve our problem.
Amazon Transcribe
Amazon Transcribe uses advanced deep learning technologies to recognize speech in audio files or video files and transcribe them into text. The output is stored in an Amazon S3 bucket that you specify or a bucket that is managed by Amazon Transcribe. The resulting transcription is a JSON file containing the full transcription of the text that includes word-level time stamping, confidence level, and punctuation. Content creators can easily use time-stamp information to sync the transcript with existing content to produce subtitles for the videos.
Amazon Transcribe can use custom vocabularies to help improve the accuracy of the transcription. You simply need to provide the customized words, jargon, and phrases to Amazon Transcribe and tell the job to use that vocabulary using the API.
In this blog post, I am using the API to launch a transcription job, set a custom vocabulary, and then use the result of the job to obtain the transcript. As you’ll see, using the word-level metadata will help us to assemble the subtitles. For more information, see the Amazon Transcribe documentation.
Amazon Translate
Amazon Translate is neural machine translation service that delivers fast, high-quality, and affordable language translation. It uses deep learning models to deliver more accurate and more natural sounding translations. The Amazon Translate SDK includes TranslateText, an API action that allows us to get translations in real-time.
In this blog post, I am using the Amazon Translate API to translate the text transcription into Spanish and German. For more information, see the Amazon Translate documentation.
Amazon Polly
Amazon Polly is a cloud service that leverages AI/ML technology to synthesize text into lifelike speech. You can simplify using Amazon Polly in your applications with an API tailored to your programming language or platform, such as Java, Python, Go, and others. Amazon Polly provides the ability for you to do the following:
-
Synthesize speech into an audio stream and cache and replay the speech output with no additional cost
-
Choose from dozens of languages and a wide selection of natural-sounding male and female voices
-
Modify your speech output using lexicons and SSML tags, such as pronunciation, volume, pitch, and speed rate
In this blog, I am using the Amazon Polly API to voice the translation in Spanish and German. I’ll save the produced audio stream as an MP3 file for inclusion as the audio track in our movie. For more information, see the Amazon Polly documentation.
Preparing our development environment
To follow along with our example, you’ll need to set up an environment, AWS account, download and install the tools we’ll need, and, finally, install the source code.
Step 1: Establish an operating system environment
This example should run on any operating system that supports Python 2.7.15 and the ImageMagick tools. I used a Windows 10 environment running in Amazon WorkSpaces. Amazon WorkSpaces is a fully managed, secure, Desktop-as-a-Service solution powered by AWS. With Amazon WorkSpaces, you have the ability to change the virtual hardware configuration through the AWS Management Console. Early on in the process when my code was less efficient, I was running the Power configuration, but as the code stabilized, I was able to reconfigure to use the Performance configuration.
Step 2: Establish an AWS account
You’ll need an Amazon Web Services account to be able to use the AWS services. The process for establishing an account can be found at https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/.
Step 3: Download and Install Python 2.7.15
Python is a programming language used by many developers for scripting batch programs, real time programs, big data, artificial intelligence, advanced analytics, and so on. There are a number of versions available. In this example, I used Python 2.7.15. Python can be downloaded from http://python.org for your target operating system environment. The process for downloading and installing will depend on your operating system, so follow the installation instructions found on the python.org site.
After Python is successfully installed, you can enter the following command to validate that it is correctly installed:
You should see something like the following:
1 |
|
Step 5: Configuring the AWS CLI
Resource– This method provides a direct way to connect to and use an Amazon service. For example, to use Amazon S3, you would use the following code to establish the connection to the resource:
import boto3
Let’s use Amazon S3
s3 = boto3.resource(‘s3’) Client – This method provides a low-level connection to an underlying service. For example, to use an Amazon Simple Queue Service (Amazon SQS) queue, you would use the following code to establish the connection to the low-level client:
import boto3
Create a low-level client with the service
sqs = boto3.client(‘sqs’)
To install Boto3, enter the following command:
You should see something like the following:
1 |
|
Step 8: Install ImageMagick
If it is correctly installed and configured, you should see the following:
1 |
|
Step 10: Test MoviePy
For testing purposes, replace the “myHolidays.mp4” file with your content. Make sure to adjust the subclip parameters if your content is not at least 60 seconds long. You can also modify the message displayed in the TextClip by changing “My Holidays 2013” to your desired message. Finally, you can modify the output video file name as required and you can also change the output type simply by specifying the extension. So, for example, instead of producing a .webm file, you can simply replace it with .mp4. See the MoviePy documentation for details.
Transcribing video using Amazon Transcribe
Our process begins by creating a transcription job with the original content. The following code is part of the main program that drives the process. Click here for the full source code.
This snippet creates the job, then loops while the status is IN_PROGRESS. After the batch job is complete, we will get the transcript response and assign it to the variable “transcript” for further processing. If we got to the “getTranscript” call, then we have successfully transcribed the content.
1 |
|
This snippet uses a utility function I created to place all of the Transcribe API calls in one Python file. The functions that follow leverage the Boto3 SDK for Python to invoke the Transcribe API. Detailed information about all of the API operations used can be found in the Transcribe Documentation.
1 |
|
A deeper look at the response
The response back for the TranscribeURI returns a JSON structure that contains a wealth of information about the content. Looking at an excerpt of the beginning of the response that follows, we can see several key items:
-
jobName
-
accountId
results – contains the full transcript, including the words and punctuation as a block of text:
{ “jobName”:”MY_JOB”,”accountId”:”1234567890”, “results”:{“transcripts”: [{“transcript”:”How about language ? You know, i talked earlier about the fact that last year we launched both polly and relax, but there are so many other things the builders want to do with language…
It also contains a very detailed structure for each word or punctuation element. For more information on the response structure, see the API Documentation.
1 |
|
Notice that each word element provides a start_time
and an end_time
. We’ll use this in later steps to create a subtitles file that exactly aligns subtitles with the audio track. Since Amazon Transcribe uses machine learning technologies to recognize words, it provides a confidence score for each identified word. If the content uses a lot of domain-specific terminology, then using a custom vocabulary will help to improve the accuracy of the transcription. Finally, the actual word spoken is found with the “content” tag.
The last tuple shown is for punctuation. Notice that the tuple doesn’t contain timing or confidence information. This is because it is inferred, and it is assumed to be associated with the previous tuple. We’ll need to address this as we process the response to prepare the subtitles file.
Translating a transcript into another language using Amazon Translate
Amazon Translate makes translating text to and from English simple. The following code snippet shows a utility function that invokes the Translate API:
1 |
|
For more details about the Translate API, see the Translate Documentation.
Creating subtitle files
Before diving into the specifics of actually creating the subtitles, there are several design considerations we must think through.
Design consideration 1: How do we build a subtitles file?
MoviePy uses a standard subtitles file format known as “SubRip Subtitle” or SRT file. SRT files have the following format:
-
Sequence number followed by a new line
-
The start and end time of the subtitle to three decimal places followed by a new line
-
The text to be displayed on the screen followed by two new lines
Here is an example of the output from our code:
1 |
|
Note that the getTimeCode
function is called two-thirds of the way through this snippet. Transcribe provides a start and end time based on the number of seconds that have passed since the beginning of the clip. However, the SRT format requires that time to be in an HH:MM:SS,MMM format. The following getTimeCode
snippet does the conversion for us.
1 |
|
Design consideration 2: What is the correct translation scope?
We all have seen overdubbed movies in which lips of the speaker don’t line up with the audio. As we prepare subtitles, we could translate word for word, or phrase by phrase in perfect alignment with the original content. However, that would likely lead to inaccurate, incomplete, or “out of context” translations. This is because each language has differing syntax, word placement, idioms, and phrasing. Also a translation doesn’t necessarily result in the same number of words in each language, hence the out-of-sync lips problem.
To illustrate the point, I used the AWS Translate console to translate the following English sentence into German and Spanish:
English
Differing syntax, word placement, and idioms distinguish one language from another.
Spanish
La sintaxis, la colocación de palabras y los modismos diferentes distinguen un idioma de otro.
German
Unterschiedliche Syntax, Wortplatzierung und Idiome unterscheiden eine Sprache von einer anderen.
As you can see, neither translation has the same number of words, nor are the words in the same relative position. If we were to only translate 10 words at a time instead of a whole sentence, we would not likely arrive at a translation with the correct overall meaning.
Bottom line: We are going to have to treat subtitles and the accompanying audio for translations differently that the original transcript.
Design consideration 3: Syncing up the audio and subtitles for translations
In our example, we’ll enhance the original video with translated subtitles, and we’ll add an alternate spoken foreign language audio track generated by Amazon Polly. However, that leads us to a potential problem to overcome. Unlike the original Amazon Transcribe output, we don’t know the start and end time for every word because it is simply a block of translated text. Even if we translate a whole sentence at a time, that still doesn’t provide timing; it only provides the translation. Moreover, unlike the original speaker, Amazon Polly will synthesize the speech at a different rate than the orginal speaker. Therefore we’ll need to calculate the duration of the spoken phrases. How do we do that?
We’ll let Amazon Polly and MoviePy help us. In this example, we’ll take the following approach:
-
Using Amazon Translate, get the complete translation from the transcript.
-
Using our Python code, break up the block of translated text into phrases of approximately 10 words so that the text fits on the screen and can be read quickly enough by the viewer.
-
For each phrase, call Amazon Polly to get the spoken audio stream and write it to a temporary audio file.
-
Load the temporary audio file into an AudioClip object from MoviePy.
-
Return the duration of the AudioClip.
Now that we know the duration of each phrase, we can leverage the simliar logic to create the SRT file for inclusion in the final video. Note that this code differs from the transcript in that our start time value is contained in the “seconds” variable and the end time of the phrase will be calculated using the getSecondsFromTranslation function. We’ll add the result to “seconds” to give us the ending time for this phrase. The code looks like this:
1 |
|
The code to get the duration from the phrase is below. Note that we need to translate the phrase, and request Amazon Polly to speak it to an audio stream. Then we can write the temporary audio stream to disk in order to load it as an AudioFileClip from MoviePy. After it’s loaded, we can determine the exact duration and use that to calculate the timing of the subtitles relative to the translated audio track.
1 |
|
Creating streamed audio files from your translation using Amazon Polly
When we request Amazon Polly to synthesize text, we have several options for specifying the type, sample rate, etc., for the output. For our example, I chose an MP3 audio stream as the desired output because MoviePy can open an MP3 file as an AudioFileClip for inclusion as an alternate voice track for our final video.
After we receive a successful response from Amazon Polly, we can write the audio stream to our temporary file:
1 |
|
Now we’ll write the binary bytes to disk. I’ve created this utility method to be used in preparing the SRT file and also when creating the full audio clip for the final movie.
1 |
|
Using MoviePy to create a composite video
As we discussed earlier, MoviePy is a comprehensive library for video composition and production. It leverages ImageMagick and FFmpeg (and others) for key functionality in building text titles, animation, audio, and videos. In our case, we use it for few of its features:
-
Read an SRT file to create subtitles and create a subtitle clip
-
Read the alternate audio track created with Amazon Polly
-
Composite all of the clips together (original content, subtitles, and alternate audio track) into the finished product
To accomplish these tasks, we’ll create a function called “createVideo”. We will go through the key parts here, then put it all together in the next section.
1 |
|
This creates a VideoFileClip
from the video file we passed in as the original content. Next we will determine whether or not we need to replace the audio portion of the clip.
1 |
|
If we are using the alternate audio track such as the one generated by Amazon Polly, we’ll load the audio file as an AudioFileClip. The subclip and set_duration lines are ensuring that the audio clip is only as long as the video clip. If we were writing a professional package, perhaps we would want to be more sophisticated about handling the length. For our example, this works fine. Finally, we replace the audio from the video clip with the audio clip.
Next we’ll create the subtitles.
1 |
|
First, we need to define a Python Lambda function (not the same as an AWS Lambda Function!) that will be used to create a TextClip with the supplied font information. Then, we’ll read the SRT file to create the SubtitlesClip
call subs. Since SubtitlesClip
is still a somewhat experimental part of MoviePy, we need to trim it to be less than the overall video clip. Otherwise, it may result in runtime errors.
After the subtitles are created, we’ll create an array of subtitled clips called annotated_clips
. Next, we concatenate all of the clips into one final clip. Finally, we’ll write the subtitled video and audio out to a new video file. Voila!
Putting it all together
Now that we’ve seen the parts, let’s put it all together into a batch file and main driver program.
makevideo.bat
1 |
|
Note that in this example, I ignore the specified output bucket. This is because I wanted demonstrate that the output can be managed by Amazon Transcribe and made available via a pre-signed URL. You can optionally output the JSON transcription to a bucket you specify instead.
translatevideo.py
For this program, I used the argparse package to be able to drive the program with command line arguments.
1 |
|
Other use cases
Now that we’ve seen how to use Amazon Transcribe, Amazon Translate, Amazon Polly, and Amazon S3 to create subtitled and translated videos from a transcript generated with video content, let’s think about some other ways we may be able to apply these technologies.
-
Call Center Transcriptions – Since Amazon Transcribe can recognize multiple voices, it can be used to identify the caller and agent in a customer service call. Not only can all centers analyze transcripts to improve service quality, they can also use the transcripts to develop subtitles for future training videos as well.
-
Translating a podcast into another language – Ever wanted to expand your podcast listener base, but didn’t speak another language? Similar to the other use cases, use Amazon Transcribe to obtain a transcript of the podcast, then translate it using Amazon Translate, and “speak” it with Amazon Polly. Keep in mind that if your Podcast contains technical jargon, you can add a custom lexicon to Amazon Polly to improve the pronunciation.
Conclusion
In this blog post, we’ve demonstrated a working example using Amazon Transcribe, Amazon Translate, and Amazon Polly to solve the subtitling and translation problems. We’ve also talked through some of the key design issues you’ll face when taking this example to the next level for your own projects. With the rapid progress being made in NLP technologies, it will be fun to see how far this can go in the near future.
About the Author
Rob Dachowski is a Solutions Architect aligned with the AWS National Systems Integrator team. With many years of experience architecting, designing, and implementing large scales systems, Rob has helped enterprise customers migrate to the AWS Cloud. As a professional musician and radio show producer, he has a keen interest in audio and video production tools.