Using Google Speech API to transcribe interviews

What do London, Madrid, Seoul, Tel Aviv, Warsaw and São Paulo have in common?

They are the cities where Google has set up a Google Campus!

I recently checked in for a series of presentations. And the first one blew my mind.

Checking in for #CampusExpertsSummit @CampusSaoPaulo pic.twitter.com/jQcmpAVocx

— Christophe Buffet (@cpjfb) April 18, 2017

Onome Ofoman, a “googler” from New York, demoed a couple of Machine Learning tricks, including speech to text conversion (or speech recognition, or automatic transcription) in a few terminal commands!

So when I went back home that evening, I spent a couple of hours figuring out how the Google Cloud Platform works, set up an account, and very quickly got a result with a recording I had just done!

I've just used #ML to transcribe an interview and it blew my mind. Definitely needs some editing, but wow. Works in any language. #musetech pic.twitter.com/z8T5TrKrvH

— Christophe Buffet (@cpjfb) April 20, 2017

Transcribing interviews is incredibly helpful but equally time consuming. For me and anyone who works with large amounts of audio interviews, the prospect of automating this chore is really appealing.

Google’s Speech API works in 80 languages, including a variety of local accents. I have been testing it with recordings in British and American English, French, Brazilian Portuguese, Colombian and Argentinian Spanish.

Of course, the transcriptions are not perfect, but they come back so fast that it’s very useful to figure out (even approximately) what’s on “tape”.
(Update 01/2018: now it looks like you can get timestamps too)

It is also incredibly cheap. Whereas most online transcription services charge US$1 per minute of audio, you get 60 free minutes per month to use Google’s Speech API. The next 61 – 1 million minutes are billed at $0.006 per 15 seconds. That’s $0.024 per minute. $1.44 per hour!

Here’s a summary of my findings, to quickly get running on a Mac.

1. Set up your GCP project

The process is described in detail in this Quickstart page of the documentation.

It involves “Enable billing for your project”, but Google offers a 30 days free trial and discounts for non-profits (or campus resident startups).

The Google Speech API makes use of the gcloud command line tool, which is distributed within the Google Cloud Platform Cloud SDK.

2. Install Sox

SOX is “the Swiss Army knife of sound processing programs”.

Sox lets you convert audio files on the command line (in the terminal) and its companion soxi lets you find a lot of information about your files.

I assume Homebrew is installed on your Mac.

brew install sox --with-lame --with-flac --with-libvorbis --with-opusfile --with-opus-tools

To be able to work with .mp3 files, we need to install the LAME codec library.

I also want to work with OPUS encoded .ogg files provided by WhatsApp, so I will use the Vorbis Library, the opus file and opus tools option (I’m not quite sure what the difference is between the two but it works this way).

Why WhatsApp? It is so ubiquitous here in Brazil that it could be a way for museum visitors to contribute with their own recordings to the audio guide using their own smartphone. The .ogg files are very small, which means it doesn’t blow the user’s data plan when no wifi is available. Transcribing these contributions automatically could be a way to quickly identify the most interesting ones.

3. Convert audio files for transcription

Ideally, convert from stereo/mono WAV/AIFF files to mono FLAC files.

The Google Cloud Platform has a very good summary of the best encoding options.

Basically:

channels: convert from stereo to mono
codec: use lossless FLAC (uncompressed LINEAR16 PCM files are much bigger)
Sampling rate: use 16000Hz

Size matters as it will have an impact on your storage space. Converting from stereo to mono will diminish the file size.

However, converting from any compressed format (MP3, OGG) to lossless FLAC will increase the file size. Still, FLAC files will be smaller than LINEAR16 PCM files.

The Google Cloud Platform lists a couple more best practices regarding audio pre-processing.

It’s best to provide audio that is as clean as possible by using a good quality and well-positioned microphone. However, applying noise-reduction signal processing to the audio before sending it to the service typically reduces recognition accuracy. The service is designed to handle noisy audio.

The audio level should be calibrated so that the input signal does not clip, and peak speech audio levels reach approximately -20 to -10 dBFS.

The device should exhibit approximately “flat” amplitude versus frequency characteristics (+- 3 dB 100 Hz to 8000 Hz).

Total harmonic distortion should be less than 1% from 100 Hz to 8000 Hz at 90 dB SPL input level.

(I haven’t been that specific in my mix settings and should test further for performance and accuracy)

4. To convert

Here’s the command line to convert a (stereo or mono) WAV file to a mono FLAC file

sox -G input.wav --channels=1 --bits=16 --rate=16000 output.flac

or, for a MP3 file

sox -G input.mp3 --channels=1 --bits=16 --rate=16000 output.flac

(Ok, it is a bit absurd to convert from compressed MP3 format to uncompressed FLAC format, like blowing up a JPEG image, but if all your audio files are MP3s it works well enough)

The -G option avoids clipping.

5. Batch convert many files

Here’s a shell script to batch convert WAVs

#!/bin/bash
#### description: batch converts wavs to flac for transcription
#### save as batchwavtoflac.sh
#### make executable: chmod u+x batchwavtoflac.sh
#### add to path or mv to /usr/local/bin/batchwavtoflac

for audiofile in ./*.wav
do
    out=${audiofile/.wav/.flac}
    sox -G "$audiofile" --channels=1 --bits=16 --rate=16000 $out
    echo "$out"
done

or MP3s

#!/bin/bash
#### description: batch converts mp3s to flac for transcription
#### save as batchmp3toflac.sh
#### make executable: chmod u+x batchmp3toflac.sh
#### add to path or mv to /usr/local/bin/batchmp3toflac

for audiofile in ./*.mp3
do
    out=${audiofile/.mp3/.flac}
    sox -G "$audiofile" --channels=1 --bits=16 --rate=16000 $out
    echo "$out"
done

6. Copy (upload) your FLAC files to your GCP storage bucket

Files longer than 1 minute must be hosted on the Google Cloud Platform in order to use asynchronous speech recognition.

You do this either with gsutil cp

gsutil cp *flac gs://bucket/destinationfolder

or gsutil rsync

gsutil -m rsync -r localfolder gs://bucket/destinationfolder

Replace bucket with the name of your own GCP storage bucket and destinationfolder with something that makes sense for you.

7. Finally, transcribe with Google Speech API

Assuming Java is locally installed and you’ve git cloned the Java Google Cloud Speech API Samples from GitHub

java -cp /pathto/GoogleCloudPlatform/java-docs-samples/speech/cloud-client/target/speech-google-cloud-samples-1.0.0-jar-with-dependencies.jar com.example.speech.Recognize asyncrecognize gs://bucket/filetotranscribe.flac

Although you can edit and recompile the Java samples, it’s actually easier to pass parameters (like a different “language code”) to the PHP samples.

So, assuming PHP is locally installed as well as the PHP Google Cloud Speech API Samples

php /pathto/GoogleCloudPlatform/php-docs-samples/speech/api/speech.php transcribe gs://bucket/filetotranscribe.flac --language-code en-US --encoding FLAC --sample-rate 16000 --async

Replace in each case pathto with your own path to your local installation and bucket with the name of your own GCP storage bucket.

Note the asyncrecognize or async parameters, for asynchronous speech recognition with files longer than 1 minute.

8. You’ll get something like this

Waiting for operation to complete
Waiting for operation to complete
Waiting for operation to complete
Waiting for operation to complete
Waiting for operation to complete
Waiting for operation to complete
Array
(
    [0] => Google\Cloud\Speech\Result Object
        (
            [info:Google\Cloud\Speech\Result:private] => Array
                (
                    [alternatives] => Array
                        (
                            [0] => Array
                                (
                                    [transcript] => Rafael Santi who was born in 1483 and died in 1520 on his 37th birthday he died a terribly young man in the twenty years of his career created a revolution in painting in Italy he's usually bracketed with Leonardo and Michelangelo and yet those some how are the Artists you captivate the modern attention they have the the more dramatic stories the unfinished works Rafael however for me is the much more complete artist he wanted his works to be perfect and take a picture like the Resurrection in mass spec at this is a picture that threw his careful studies for it he brings to absolute perfection that perfectionism continue throughout his work for the two great Renaissance popes Julius II and Leo 10th in the Vatican Stan
                                    [confidence] => 0.95984364
                                )

                        )

                )

        )

    [1] => Google\Cloud\Speech\Result Object
        (
            [info:Google\Cloud\Speech\Result:private] => Array
                (
                    [alternatives] => Array
                        (
                            [0] => Array
                                (
                                    [transcript] => Jenny went beyond being a painter of altarpieces and a portraits and a small Madonna's to being an architect archaeologist a designer for Prince which will the new medium of the age a designer for tapestry a true Renaissance Man
                                    [confidence] => 0.938933
                                )

                        )

                )

        )

)

9. And with a bit of editing

Raphael Santi who was born in 1483 and died in 1520 on his 37th birthday —he died at terribly young man- in the 20 years of his career created a revolution in painting in Italy.

He’s usually bracketed with Leonardo and with Michelangelo and yet those somehow are the artists who captivate the modern attention. They have the the more dramatic stories, the unfinished works.

Raphael however for me is a much more complete artist. He wanted his works to be perfect and take a picture like the Resurrection in MASP. This is a picture that through his careful studies for it, he brings to absolute perfection.

That perfectionism continued throughout his work for the two great Renaissance Popes Julius 2 and Leo 10 in the Vatican Stanze. And he went beyond being a painter of altar pieces and of portraits and of small Madonnas to being an architect, archaeologist, a designer for prints which were the new medium of the age, a designer for tapestry, a true Renaissance Man.

Rafael, Ressurreição de Cristo, 1499 – 1502, Acervo MASP