Web Speech API Tutorial for Developers

We will show you code snippets, using the real code that is used in Speechlogger – the online speech recognition app. This way – you can actually see where and how this is all implemented in a real app. Feel free to use the snippets in your own apps; if you do, please link back to us. It should be noted that the API is very easy to use as it is simple JavaScript. This guide is meant to get you started in minutes, so you can have your voice recognizing app up and running. For more in depth info – see external links at the end.

A short introduction to the API specification and this guide

The Web Speech API specification was introduced in 2012 by the W3C Community. Its goal was to enable modern browsers recognize and synthesize speech. As of July 2015, Chrome is the only browser that implemented that specification, using Google’s speech recognition engines.
As web developers we should be very happy about that, as it opens us a whole new world of opportunities for new web apps and new interaction features in existing apps. Furthermore, since Google opened its very own speech recognition engine to support that API, we are able to incorporate the very best speech recognizer out there. At this point – the API to Google is free, but there is no guarantee it will continue to be so.

Speech recognition in the browser - how does it work?

In short – these are the main steps to implement a speech recognizer online:
1) Tell the browser what service we want to use.
2) Ask it to initiate a voice listener through the API. The browser (Chrome only at this point) will capture the audio by itself, package it and send it to the service.
3) All we have to do is wait for the text responses and decide what to do with them.
Let’s go step by step, accompanied by the real life code from Speechlogger:

1. Make sure the web speech API is supported by the user’s browser

if (!('webkitSpeechRecognition' in window)) {
    //Speech API not supported here…
} else { //Let’s do some cool stuff :)
    var recognition = new webkitSpeechRecognition(); //That is the object that will manage our whole recognition process. 
    recognition.continuous = true;   //Suitable for dictation. 
    recognition.interimResults = true;  //If we want to start receiving results even if they are not final.
    //Define some more additional parameters for the recognition:
    recognition.lang = "en-US"; 
    recognition.maxAlternatives = 1; //Since from our experience, the highest result is really the best...
}

Note that we defined recognition characteristics, including language to recognize. Note the language code for English – US accent. The full list of language codes can be found here.

2. Define what happens as recognition starts, ends, and receives results.

This code should only run if 'webkitSpeechRecognition' is indeed supported. Once the object "recognition" is created, we can define what happens when its callback functions are fired. Here are the main functions:

recognition.onstart = function() {
    //Listening (capturing voice from audio input) started.
    //This is a good place to give the user visual feedback about that (i.e. flash a red light, etc.)
};

recognition.onend = function() {
    //Again – give the user feedback that you are not listening anymore. If you wish to achieve continuous recognition – you can write a script to start the recognizer again here.
};

recognition.onresult = function(event) { //the event holds the results
//Yay – we have results! Let’s check if they are defined and if final or not:
    if (typeof(event.results) === 'undefined') { //Something is wrong…
        recognition.stop();
        return;
    }

    for (var i = event.resultIndex; i < event.results.length; ++i) {      
        if (event.results[i].isFinal) { //Final results
            console.log("final results: " + event.results[i][0].transcript);   //Of course – here is the place to do useful things with the results.
        } else {   //i.e. interim...
            console.log("interim results: " + event.results[i][0].transcript);  //You can use these results to give the user near real time experience.
        } 
    } //end for loop
};

There are additional callback functions, which we will ignore for now, as they are optional. We will introduce them later.

3. All that is left is to actually start listening

Now that the recognition Object is well defined, including its parameters and functions we are ready to start listening. For that we would want a button that on user’s click will start the listener.

<div onclick="startButton(event);"><img alt="Start" id="start_img" src="https://speechlogger.appspot.com/images/micoff2.png"></div>

And our called function will start the listener:

function startButton(event) {
    recognition.start();
    start_img.src = 'https://speechlogger.appspot.com/images/micslash2.png'; //We change the image to a slashed until the user approves the browser to listen and recognition actually starts. Then – we’ll change the image to ‘mic on’.
}

That’s it! You can run your speech recognizer in your own web app :)

Other optional (nice-to-have) functions:

To stop the listener before completion:

recognition.stop();

Optional callback functions:

recognition.onspeechstart = function() {};  

recognition.onspeechend = function() {};

recognition.onnomatch = function(event) {};

recognition.onerror = function(event) {};

External links for additional information

Web Speech API Specification
Voice Driven Web Apps: Introduction to the Web Speech API – On Google Developers

Please send us your feedback. Also, if you found it helpful, please share and link back to us on your apps and sites.
Good luck – go do awesome apps!

Web Speech API "Hands On" Tutorial for Developers