Creating voice recognition bot web app with Amazon AWS Lex

Creating voice recognition bot web app with Amazon AWS Lex

18.07.2017
,
Christina Panayotova

When I took upon this project, I had no idea how hard it will be every single step of the way. Amazon only recently released Lex to the general public, so information online was quite scarce when it comes to custom implementations. In all fairness, Amazon provided a vast documentation for the bot creation as well as for the integrations with Facebook messenger, Slack and Twilio.

 

On the other end of the spectrum was the web voice API, which had so much information, that it seemed even if I read 1 week in a roll, I won’t get the grasp of it. Worst of all, the Web Consortium changed things so most of the examples in Stack Overflow and plugins were morally and technologically outdated.

 

However, after a lot of white hair, the demo was done and the concept of Web App with Lex was proven.

Goals

  1. Create a voice recording via your browser of choice (not sure about IE, as I work on OSX).
  2. Send the voice via AWS SDK.
  3. Receive answer from Lex and play it back to the user.

Pretty straightforward, right? Wrong

I won’t ramble upon how I got to every single one of those steps, I will just go through them thoroughly, because they are very specific :) Btw, the timeline in which I got to them hardly resembled this one, but if I had to do it all over again, this is how I would do it:

Steps to create the Lex Web App

Set up a test bot

AWS Lex documentation is explaining how to create a bot very well, so for this example I will go with the default “Order Flowers” bot

  1. Go to Services -> Artificial Intelligence -> Lex
  2. Click Create
  3. Choose the “Sample”: “Order flowers”
  4. Click Create

This bot has two utterances: “I would like to pick up flowers” and “I would like to order some flowers” which are more than enough to test.

Set up AWS Cognito Account

This step might seem a bit redundant at this point but, you will use this account a bit later so it is best if we just have it available while we are at the AWS console.

  1. Go to Cognito
  2. Click “Manage Your user pools”
  3. Create a new user pool, let’s say this one will be called “Lex”.
  4. Go to “users and Groups” on the left navigation
  5. Create a new user following the prompt
  6. Go to “Federated Identities” (it is located in the Header, on the left, next to the “User Pools” heading.
  7. If you don’t have on created: Create, I called mine “Lex”.
  8. I also enabled access to Unauthenticated identities, which should not be done, but I was ok for the “proof of concept” I was doing
  9. It will ask you to create a new IAM role, you click “Allow”
  10. Save the number you get provided:
    Siili Solutions
  11. Or if you already have one: click on “lex”
  12. Click on “edit identity pool” and copy and paste the “Identity pool ID” somewhere. It should look like:
    us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Give your user access to Lex

Go to IAM. (Services -> Security, Identity & Compliance -> IAM) and either create a new role or give one of the existing roles permissions to Lex. You can do that by going to Roles on the left side menu, clicking on one of the existing roles (or the new one), Permissions -> Managed Policies -> Attach Policy. Start typing in the filter “Lex..” and attach the “AmazonLexFullAccess” policy. Now return to Cognito. Create a group and attach the IAM role with Lex permissions to that group.

Disclaimer: Since this project is a Proof of Concept, the security is not on the required level, so I recommend diving deeper into securing your App if you place it to production.

Prepare your app html

To prepare your app, you need a simple html doctype with Amazon SDK script in the heading
<script src="https://sdk.amazonaws.com/js/aws-sdk-2.54.0.js"></script>.

Also two simple buttons: Play and Stop as well as two audio tags. I use the first for audio test of my speech command and the second one is used for the Lex response:

   <button class="button" id="startBtn">START</button>
    <button class="button" id="stopBtn">STOP</button>
    <audio id="audio" controls>No support of audio tag</audio>
    <audio id="audioResponse" controls>No support of audio tag</audio>

Also declare the main.js file.

Capture browser audio

For audio capture I will be using the getUserMedia with MediaRecorder API.

I decided to prompt the user right away for this demo, but bear in mind that most users will feel the urge to disable the mic usage the first time they land on your page.

navigator.mediaDevices.getUserMedia({audio:true})
    .then(function onSuccess(stream) {

      var recorder = window.recorder = new MediaRecorder(stream);

      var data = [];
      recorder.ondataavailable = function(e) {
        data.push(e.data);
      };

      recorder.onerror = function(e) {
        throw e.error || new Error(e.name);       }

      recorder.onstart = function(e) {
        data = [];
      }

      recorder.onstop = function(e) {

        var blobData = new Blob(data, {type: 'audio/x-l16'});
     
        audio.src = window.URL.createObjectURL(blobData);

        var reader = new FileReader();
        reader.onload = function() {

          audioContext.decodeAudioData(reader.result, function(buffer) {

            reSample(buffer, 16000, function(newBuffer){

              var arrayBuffer = convertFloat32ToInt16(newBuffer.getChannelData(0));              sendToServer(arrayBuffer);
            });
          });
        };
        reader.readAsArrayBuffer(blobData);
      }

    })
    .catch(function onError(error) {
      console.log(error.message);
    });

So here we set up the field for the events that we will trigger with the start stop buttons. The events themselves are simple as that:

  var startBtn = document.getElementById('startBtn');
  var stopBtn = document.getElementById('stopBtn');

  startBtn.onclick = start;
  stopBtn.onclick = stop;

  function start(){
    window.recorder.start()
  }

  function stop(){
    window.recorder.stop()
  }
Why do we need ‘convertFloat32ToInt16’ and ‘reSample’

So you might have noticed those functions. It is because Lex expects an audio/x-l16; sample-rate=16000 audio data and the buffer we get from the MediaRecorder API is with sample rate 44100. Even worse, we do not get the buffer right away, we need to decode the audio data that comes in the form of a Blob to a buffer so we can down sample it to 16khz. This is exactly why we need to initiate a new FileReader, where we will write the data contained in an array buffer, decoded asynchronously by the decodeAudioData. To resample and convert we use the following functions:

function reSample(audioBuffer, targetSampleRate, onComplete) {
      var channel = audioBuffer.numberOfChannels;
      var samples = audioBuffer.length * targetSampleRate / audioBuffer.sampleRate;

      var offlineContext = new OfflineAudioContext(channel, samples, targetSampleRate);
      var bufferSource = offlineContext.createBufferSource();
      bufferSource.buffer = audioBuffer;

      bufferSource.connect(offlineContext.destination);
      bufferSource.start(0);
      
offlineContext.startRendering().then(function(renderedBuffer){
          onComplete(renderedBuffer);
      })
  }

function convertFloat32ToInt16(buffer) {
      var l = buffer.length;
      var buf = new Int16Array(l);
      while (l--) {
          buf[l] = Math.min(1, buffer[l]) * 0x7FFF;
      }
      return buf.buffer;
  }
Fun part: send to server

Finally, it is time to send the data we obtained and down sampled to Lex. First of all, establish your credentials. Here is where Cognito id comes. Set up the credentials in a variable as so:

var myCredentials = new AWS.CognitoIdentityCredentials({IdentityPoolId:'us-east-1:xxxxxxxx-xxx-xxx-xxxx-xxxxxxxxxxxx'}),
myConfig = new AWS.Config({
    credentials: myCredentials,
    region: 'us-east-1'
  });

The documentation for the postContent here is clear of what you need for the post request. Required fields are “botAlias”, “bot Name”, “contentType”, “inputStream”, “userId”.

Disclaimer: as this is a proof of concept demo, I used just a dummy userId. Amazon Lex uses this to identify a user's conversation with your bot. So this can be anything unique to your user: name, personal id, session cookie it is up to the developer to provide one to Lex.

If your request was correct and everything went as planned, you will receive an audioStream that you need to convert to Blob that you can now play to the user the audio tag in the html.

function sendToServer(audioData){

    var params = {
      botAlias: '$LATEST', /* required */
      botName: 'OrderFlowers', /* required */
      contentType: 'audio/x-l16; sample-rate=16000; channel-count=1', /* required */
      inputStream: audioData, /* required */
      userId: 'xxxxxxxxxxxxxxxxxxxxxxxxxx', /* required */
      accept: 'audio/mpeg',
      //sessionAttributes: '' /* This value will be JSON encoded on your behalf with JSON.stringify() */
    };
    var lexruntime = new AWS.LexRuntime();
    lexruntime.postContent(params, function(err, data) {
      if (err) console.log('ERROR!', err, err.stack); // an error occurred
      else {
         var uInt8Array = new Uint8Array(data.audioStream);
         var arrayBuffer = uInt8Array.buffer;
         var blob = new Blob([arrayBuffer]);
         var url = URL.createObjectURL(blob);
        audioResponse.src = url;
        audioResponse.play();
      }
    });
  }

If you don’t get any errors, you should hear Lex speaking back to you :) Please feel free to roam the code hosted in git: https://github.com/vasilevach/Web-App-with-Amazon-AWS-Lex