Speech Recognition on Roku using Google Cloud Speech-to-Text API

On Roku OS 7.6, Microphone APIs were made available to developers, bringing some very capable audio-capturing features. This allows us, developers, to integrate cloud-based APIs to process recorded audio for multiple purposes, one of those being Speech Recognition.

Arguably the best Speech Recognition technology available today is Google’s Cloud Speech-to-Text API. The API recognizes 120 languages and variants, it supports automatic punctuation, inappropriate content filtering, and many other advanced features, and yes, the results are almost flawless. You can read all about it on their website and even try it out.

Why would we want Speech Recognition?

  • Dictation — you can enable dictation on your custom keyboards, to make entering logins and passwords or searching for content, much easier and faster.
  • Accessibility — you can significantly improve the accessibility features of your app by allowing users to navigate just with their voice.
  • Or you could just make a Flappy Bird game where instead of pressing a button every time, you get to say “Jump” (I would play that).

Prerequisites

Before we continue, there are a few things that we need to cover to be able to run and test the demo app.

Before getting started

To make things easier during this tutorial, let’s enable “Always allow” on the Roku microphone settings, otherwise every time the app is side-loaded, the Roku will ask for permission.

To enable “Always allow”, go to Settings/Privacy/Microphone/Channel microphone access/ and select “Always allow”.

Also, as with any Google Cloud API, the API has to be enabled on a project within the Google Cloud Console and all the API calls will be associated to that project.

Summarized steps:

  1. Create a project (or use an existing one) in the Cloud Console.
  2. Make sure that billing is enabled for your project.
  3. Enable the Speech-to-Text API.
  4. Create an API key.

Show me the code

For this demo, we’ll make a very simple app, with just a label, that will display the transcript text, representing the words that the user spoke.

To begin, let’s download the starter project from here.

Uncompress the project (or not, depending which side-loading solution you’re using) and side-load.

You should see just this:


Right now nothing happens.

To fix that, let’s begin by adding the brains of the operation to our project, the SpeechRecognizer component. This component will handle all the communication between our app and the Google Cloud Speech-to-Text API.

<component name="SpeechRecognizer" extends="Task" >

  <script type="text/brightscript" uri="pkg:/components/SpeechRecognizer.brs"/>

  <interface>
    <!-- Callback listener -->
    <field id="delegate" type="node"/>

    <function name="startListening"/>
  </interface>

</component>



function startListening(params = invalid)
  m.top.functionName = "runRecognizer"
  m.top.control = "RUN"
end function

sub runRecognizer()
  mic = createObject("roMicrophone")

  ' If can't record, there's no point on continuing
  if not mic.CanRecord()
    ? "Can't use microphone. Check that you're using a valid remote and that microphone usage is allowed."
    ' Developer must handle error here.
    return
  end if

  ' Set port to listen for microphone events https://sdkdocs.roku.com/display/sdkdoc/roMicrophoneEvent
  port = CreateObject("roMessagePort")
  mic.SetMessagePort(port)

  ' Start recording audio.
  ' When initiating recording for the first time within the app,
  ' the OS will display a popup asking the user for permission.
  mic.StartRecording()

  ' Create buffer that will contain all the captured audio data bytes
  buffer = CreateObject("roByteArray")

  ' Start loop to begin capturing the audio recording events
  while true
    ' Capture microphone event
    event = wait(0, port)
    if event.IsRecordingInfo() ' The user is holding the OK button
      info = event.GetInfo()
      buffer.append(info.sample_data)
    else ' The user released the OK button
      exit while
    end if
  end while

  if buffer.count()
    ' The audio content must be a base64-encoded string representing the audio data bytes
    audioContent = buffer.ToBase64String()
    body = buildRequestBody(audioContent)
    headers = {
      "X-Goog-Api-Key": kAPIKey()
      "Content-Type": "application/json; charset=utf-8"
    }
    response = makePostRequest(kAPIUrl(), body, headers)

    ' Access transcript. Developer must handle error here.
    results = response.results
    transcript = ""
    for each result in results
      for each alternative in result.alternatives
        transcript += alternative.transcript
      end for
    end for
    m.top.delegate.callFunc("speechRecognizerDidReceiveTranscript", transcript)
  else
    ' Developer must handle error here.
  end if
end sub

function buildRequestBody(audioContent as String) as String
  bodyObject = {
    "audio": {
      "content": audioContent
    },
    "config": {
      ' Only available using the beta API: https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig
      "enableAutomaticPunctuation": true,
      ' See allowed encoding types here: https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig#AudioEncoding'
      "encoding": "LINEAR16",
      ' See supported languages here: https://cloud.google.com/speech-to-text/docs/languages
      "languageCode": "en-US",
      ' See allowed rates here: https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig
      "sampleRateHertz": 16000
    }
  }

  body = FormatJson(bodyObject)
  return body
end function

' Just a function that makes a POST request
function makePostRequest(url as String, body as String, headers as Object) as Object
  request = CreateObject("roUrlTransfer")
  request.setCertificatesFile("common:/certs/ca-bundle.crt")
  request.initClientCertificates()

  port = CreateObject("roMessagePort")
  request.setPort(port)
  request.setUrl(url)
  request.setHeaders(headers)
  request.RetainBodyOnError(true)

  timeout = 20000
  request.asyncPostFromString(body)
  event = wait(timeout, port)

  response = event.getString()
  ?"ResponseCode "event.getResponseCode()
  ?"Response "response
  responseObject = ParseJson(response)
  return responseObject
end function

function kAPIUrl() as String
  ' Production API: https://cloud.google.com/speech-to-text/docs/reference/rest/v1/speech/recognize
  ' Uncomment to use production API. `enableAutomaticPunctuation` will not work.
  ' return "https://speech.googleapis.com/v1/speech:recognize"

  ' Beta API: https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/speech/recognize
  return "https://speech.googleapis.com/v1p1beta1/speech:recognize"
end function

function kAPIKey() as String
  return YOUR_API_KEY
end function

Make sure you go through all the comments in SpeechRecognizer.brs, they explain why and how everything is setup, with references to the documentation. Also, before continuing, we have to replace YOUR_API_KEY with the actual API key generated in the steps above.

To use the SpeechRecognizer component, we just need to do:

m.recognizer = CreateObject("roSGNode", "SpeechRecognizer")
m.recognizer.delegate = m.top
m.recognizer.callFunc("startListening", {})

…and also the delegate callback must be implemented, we’ll touch on that again.

So let’s use the SpeechRecognizer component. Roku’s microphone is only accessible while holding the “OK” button, so we must run the speech recognizer on a “OK” button press. To do that, we simply need to implement the onKeyEvent function and listen for an “OK” button press. We will also add a helper function to only instantiate the recognizer if needed. Let’s update AppScene by adding:

function recognizer()
  if m.recognizer = invalid
    m.recognizer = CreateObject("roSGNode", "SpeechRecognizer")
    m.recognizer.delegate = m.top
  end if
  return m.recognizer
end function

function onKeyEvent(key as String, press as Boolean) as Boolean
  if key = "OK" and press
    recognizer().callFunc("startListening", {})
    m.infoLb.text = "Listening..."
    m.label.text = ""
  end if
  return true
end function

Now, side-load the app again, press and hold the “OK” button and say something, like “Hi Roku, can you hear me?” (You don’t have to explicitly say the punctuation marks, they will be detected automatically). If everything is set up correctly, the OS should display a “listening” animation at the bottom of the screen, notifying the user that the microphone is being used.

Also, you should see in the BrightScript console, a print out similar to this:

ResponseCode 200
Response {
“results”: [
{
“alternatives”: [
{
“transcript”: “Hi Roku, can you hear me?”,
“confidence”: 0.93763953
}
]
}
]
}

Do you see it? YES!! This is HUGE, we just converted speech to text on a Roku device!

Now the only thing left, is to implement the SpeechRecognizer delegate callback, so that we can display the results on our label.

But first, we have to declare the callback function in AppScene.xml.

<component name="AppScene" extends="Scene">
  <interface>
  	<!-- Speech Recognizer callbacks -->
    <function name="speechRecognizerDidReceiveTranscript"/>
  </interface>
  <script type="text/brightscript" uri="pkg:/components/AppScene.brs"/>
</component>

And then we implement the that function in AppScene.brs, like so

function speechRecognizerDidReceiveTranscript(transcript as String)
  m.label.text = transcript
  m.infoLb.text = "Hold the OK button to start dictation, release it once you're done."
end function

Side-load one last time, press and hold the “OK” button, and say the same phrase or whatever you want to. The label should be updated with what you said after you release the “OK” button.

Are you seeing the same? PERFECT!! That’s it, now you can say that you successfully “applied the most advanced deep learning neural network algorithms to audio for speech recognition with unparalleled accuracy” on a Roku. I bet you always wanted to say that.

The complete project is available here.

Some tips

  • Modern applications support real-time speech recognization, meaning that as the user speaks, letters/words should start displaying, I would be interested on seeing how would you implement that. As a hint, Google also provides a RPC API, and that API has a method named StreamingRecognize 😉.
  • Know your limits. Speech recognization is not cheap nor limitless, you should check pricing and limitations. You must strategically place and consume this feature from within your app, you can start by capping the time the user can record audio to 15s-30s, that should fit most apps.
  • Your app should be able to gracefully lock/restore input interactions depending on the request state.

That’s all for now, thanks for reading! See you next time!