Speech Recognition on Roku using Google Cloud Speech-to-Text API

September 5, 2018

Blog Roku

On Roku OS 7.6, Microphone APIs were made available to developers, bringing some very capable audio-capturing features. This allows us, developers, to integrate cloud-based APIs to process recorded audio for multiple purposes, one of those being Speech Recognition.

Arguably the best Speech Recognition technology available today is Google’s Cloud Speech-to-Text API. The API recognizes 120 languages and variants, it supports automatic punctuation, inappropriate content filtering, and many other advanced features, and yes, the results are almost flawless. You can read all about it on their website and even try it out.

Why would we want Speech Recognition?

  • Dictation — you can enable dictation on your custom keyboards, to make entering logins and passwords or searching for content, much easier and faster.
  • Accessibility — you can significantly improve the accessibility features of your app by allowing users to navigate just with their voice.
  • Or you could just make a Flappy Bird game where instead of pressing a button every time, you get to say “Jump” (I would play that).

Prerequisites

Before we continue, there are a few things that we need to cover to be able to run and test the demo app.

Before getting started

To make things easier during this tutorial, let’s enable “Always allow” on the Roku microphone settings, otherwise every time the app is side-loaded, the Roku will ask for permission.

To enable “Always allow”, go to Settings/Privacy/Microphone/Channel microphone access/ and select “Always allow”.

Also, as with any Google Cloud API, the API has to be enabled on a project within the Google Cloud Console and all the API calls will be associated to that project.

Summarized steps:

  1. Create a project (or use an existing one) in the Cloud Console.
  2. Make sure that billing is enabled for your project.
  3. Enable the Speech-to-Text API.
  4. Create an API key.

Show me the code

For this demo, we’ll make a very simple app, with just a label, that will display the transcript text, representing the words that the user spoke.

To begin, let’s download the starter project from here.

Uncompress the project (or not, depending which side-loading solution you’re using) and side-load.

You should see just this:


Right now nothing happens.

To fix that, let’s begin by adding the brains of the operation to our project, the SpeechRecognizer component. This component will handle all the communication between our app and the Google Cloud Speech-to-Text API.

[sourcecode language=”plain”]
<component name="SpeechRecognizer" extends="Task" >

<script type="text/brightscript" uri="pkg:/components/SpeechRecognizer.brs"/>

<interface>
<!– Callback listener –>
<field id="delegate" type="node"/>

<function name="startListening"/>
</interface>

</component>

function startListening(params = invalid)
m.top.functionName = "runRecognizer"
m.top.control = "RUN"
end function

sub runRecognizer()
mic = createObject("roMicrophone")

‘ If can’t record, there’s no point on continuing
if not mic.CanRecord()
? "Can’t use microphone. Check that you’re using a valid remote and that microphone usage is allowed."
‘ Developer must handle error here.
return
end if

‘ Set port to listen for microphone events https://sdkdocs.roku.com/display/sdkdoc/roMicrophoneEvent
port = CreateObject("roMessagePort")
mic.SetMessagePort(port)

‘ Start recording audio.
‘ When initiating recording for the first time within the app,
‘ the OS will display a popup asking the user for permission.
mic.StartRecording()

‘ Create buffer that will contain all the captured audio data bytes
buffer = CreateObject("roByteArray")

‘ Start loop to begin capturing the audio recording events
while true
‘ Capture microphone event
event = wait(0, port)
if event.IsRecordingInfo() ‘ The user is holding the OK button
info = event.GetInfo()
buffer.append(info.sample_data)
else ‘ The user released the OK button
exit while
end if
end while

if buffer.count()
‘ The audio content must be a base64-encoded string representing the audio data bytes
audioContent = buffer.ToBase64String()
body = buildRequestBody(audioContent)
headers = {
"X-Goog-Api-Key": kAPIKey()
"Content-Type": "application/json; charset=utf-8"
}
response = makePostRequest(kAPIUrl(), body, headers)

‘ Access transcript. Developer must handle error here.
results = response.results
transcript = ""
for each result in results
for each alternative in result.alternatives
transcript += alternative.transcript
end for
end for
m.top.delegate.callFunc("speechRecognizerDidReceiveTranscript", transcript)
else
‘ Developer must handle error here.
end if
end sub

function buildRequestBody(audioContent as String) as String
bodyObject = {
"audio": {
"content": audioContent
},
"config": {
‘ Only available using the beta API: https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig
"enableAutomaticPunctuation": true,
‘ See allowed encoding types here: https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig#AudioEncoding’
"encoding": "LINEAR16",
‘ See supported languages here: https://cloud.google.com/speech-to-text/docs/languages
"languageCode": "en-US",
‘ See allowed rates here: https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig
"sampleRateHertz": 16000
}
}

body = FormatJson(bodyObject)
return body
end function

‘ Just a function that makes a POST request
function makePostRequest(url as String, body as String, headers as Object) as Object
request = CreateObject("roUrlTransfer")
request.setCertificatesFile("common:/certs/ca-bundle.crt")
request.initClientCertificates()

port = CreateObject("roMessagePort")
request.setPort(port)
request.setUrl(url)
request.setHeaders(headers)
request.RetainBodyOnError(true)

timeout = 20000
request.asyncPostFromString(body)
event = wait(timeout, port)

response = event.getString()
?"ResponseCode "event.getResponseCode()
?"Response "response
responseObject = ParseJson(response)
return responseObject
end function

function kAPIUrl() as String
‘ Production API: https://cloud.google.com/speech-to-text/docs/reference/rest/v1/speech/recognize
‘ Uncomment to use production API. `enableAutomaticPunctuation` will not work.
‘ return "https://speech.googleapis.com/v1/speech:recognize"

‘ Beta API: https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/speech/recognize
return "https://speech.googleapis.com/v1p1beta1/speech:recognize"
end function

function kAPIKey() as String
return YOUR_API_KEY
end function
[/sourcecode]

Make sure you go through all the comments in SpeechRecognizer.brs, they explain why and how everything is setup, with references to the documentation. Also, before continuing, we have to replace YOUR_API_KEY with the actual API key generated in the steps above.

To use the SpeechRecognizer component, we just need to do:

[sourcecode language=”plain”]
m.recognizer = CreateObject("roSGNode", "SpeechRecognizer")
m.recognizer.delegate = m.top
m.recognizer.callFunc("startListening", {})
[/sourcecode]

…and also the delegate callback must be implemented, we’ll touch on that again.

So let’s use the SpeechRecognizer component. Roku’s microphone is only accessible while holding the “OK” button, so we must run the speech recognizer on a “OK” button press. To do that, we simply need to implement the onKeyEvent function and listen for an “OK” button press. We will also add a helper function to only instantiate the recognizer if needed. Let’s update AppScene by adding:

[sourcecode language=”plain”]
function recognizer()
if m.recognizer = invalid
m.recognizer = CreateObject("roSGNode", "SpeechRecognizer")
m.recognizer.delegate = m.top
end if
return m.recognizer
end function

function onKeyEvent(key as String, press as Boolean) as Boolean
if key = "OK" and press
recognizer().callFunc("startListening", {})
m.infoLb.text = "Listening…"
m.label.text = ""
end if
return true
end function
[/sourcecode]

Now, side-load the app again, press and hold the “OK” button and say something, like “Hi Roku, can you hear me?” (You don’t have to explicitly say the punctuation marks, they will be detected automatically). If everything is set up correctly, the OS should display a “listening” animation at the bottom of the screen, notifying the user that the microphone is being used.

Also, you should see in the BrightScript console, a print out similar to this:

[sourcecode language=”plain”]
ResponseCode 200
Response {
“results”: [
{
“alternatives”: [
{
“transcript”: “Hi Roku, can you hear me?”,
“confidence”: 0.93763953
}
]
}
]
}
[/sourcecode]

Do you see it? YES!! This is HUGE, we just converted speech to text on a Roku device!

Now the only thing left, is to implement the SpeechRecognizer delegate callback, so that we can display the results on our label.

But first, we have to declare the callback function in AppScene.xml.

[sourcecode language=”plain”]
<component name="AppScene" extends="Scene">
<interface>
<!– Speech Recognizer callbacks –>
<function name="speechRecognizerDidReceiveTranscript"/>
</interface>
<script type="text/brightscript" uri="pkg:/components/AppScene.brs"/>
</component>
[/sourcecode]

And then we implement the that function in AppScene.brs, like so

[sourcecode language=”plain”]
function speechRecognizerDidReceiveTranscript(transcript as String)
m.label.text = transcript
m.infoLb.text = "Hold the OK button to start dictation, release it once you’re done."
end function
[/sourcecode]

Side-load one last time, press and hold the “OK” button, and say the same phrase or whatever you want to. The label should be updated with what you said after you release the “OK” button.

Are you seeing the same? PERFECT!! That’s it, now you can say that you successfully “applied the most advanced deep learning neural network algorithms to audio for speech recognition with unparalleled accuracy” on a Roku. I bet you always wanted to say that.

The complete project is available here.

Some tips

  • Modern applications support real-time speech recognization, meaning that as the user speaks, letters/words should start displaying, I would be interested on seeing how would you implement that. As a hint, Google also provides a RPC API, and that API has a method named StreamingRecognize 😉.
  • Know your limits. Speech recognization is not cheap nor limitless, you should check pricing and limitations. You must strategically place and consume this feature from within your app, you can start by capping the time the user can record audio to 15s-30s, that should fit most apps.
  • Your app should be able to gracefully lock/restore input interactions depending on the request state.

That’s all for now, thanks for reading! See you next time!

Related Reading