Developing an Amazon Alexa Skill on Rails
In March, we had a hack day at TaskRabbit and I did a demo of posting a task using a borrowed new-ish (at the time) Amazon Echo via Alexa. For the first time in a year, I made a new engine that would handle all these new-fangled conversational UIs and bots and stuff.
The hack day came and went (I didn’t win) and this branch was just sitting there every time I did a git branch
command. I only have a few there. Keep it clean, people! Then I saw the Cyber Monday deals on Amazon. I decided that it had sat there long enough so I dusted it off to try and bring it to the finish line.
I more or less started over, of course, because that’s how it goes. I thought I would document the process for anyone else on the trail.
Alexa Sessions
The Alexa API uses JSON to make requests and receive responses. Each session has a guid and (optional) user information.
The API has some cool session management tricks. You can return attributes that will also get passed back on the next request. This effectively gives you “memory” of the previous parts of a conversation. I chose to not do this because I am hoping to use the same engine for other similiar interfaces. Instead I save the same stuff but to a table table using the session guid as the key. In ether case, it’s important to know where you’ve been and what you need to move forward.
In our case, we want to check the box that says there has to be a linked user. Because this is checked, the Alexa App will send them through an OAuth flow on our site. So we generate a token that maps to the user in our system and Alexa stores that token in hers. Side note: it’s hard to not fully personify Alexa after talking (arguing) back and forth all week.
Hello World
Alexa is given a single endpoint for a skill. It will POST the request to that route. So I added the line to the routes.rb
file and sent it to a new SkillsController
. It looks something like this:
1 2 3 4 5 6 7 8 |
|
I used the alexa_rubykit gem with some modifications to parse the request and write the response.
So how can we get the Echo on the desk to talk to the computer? It’s only 12 inches away and yet… so far! The Alexa app in the developer console has to point to a publically accessible HTTPS site. I googled around a little bit and stumbled upon ngrok. You install ngrok and run ngrok http 3000
. This gives you a public https site that forwards to your localhost that you can put in the developer console.
Alexa Intents
To know what the user said involves the intents that are created in the developer console.
A simple example to get whatever the user said would look like this.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
You also use “utterances” to give examples of this generic input.
There are also several other helpful intents that normalize data. For example, the user can say the date and time in many ways but Amazon can normalize that and send over a known format. Other examples include commands commands like yes, no, cancel, and stop.
Here are the intents I ended up with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
I used the alexa_generator gem with some updates to declare these in a way that looks like routes. It also allows you to give examples which helps generate all the files that is needed.
For example, here is my alexa.rb
file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
Running a rake job I wrote will the generate the above intents json as well as the sample utterances for the developer console.
1 2 3 4 5 6 7 8 9 |
|
Simple Response
A simple skill would probably have one-ish intent and few examples. It would receives those in the controller, return the response, and then end the session. We would also handle a few of the states to help the user out.
The controller might look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Conversations
It all gets a bit more complicated when there is a back and forth conversation. At this point, I would say Alexa is not yet optimized for this use case.
For example, in our app with the shown set of intents, any one of them could come through. I could ask the user a yes/no question like “Your task is ready to book. Continue?” but the user could say “clean my house” or literally… anything. So I’d be expecting a AMAZON.YesIntent
but get a AMAZON.LITERAL
one. At the same time, it’s very helpful to use the built in intents for their normalization capabilities. Otherwise, I’d have to do my own natural language stuff to know all the variations of dates and ways to cancel, etc.
So the trick of a conversation seems to be to know the state, know the related intents that are expected, and merge them together as best as possible. As noted, I store the state and the data collected in the database. In concept (in reality this is spread out over many classes), we add a case statement to the controller.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
Using this pattern, you can have a decent conversation.
SDK Update Requests
There are two simple things that I think would make this a much better platform.
The first is to be able to handle conversations better. If I could include which intents I am expecting back from the thing I just asked, everything would be 10x better.
The issue can be seen when the app asks for more details about the app. Basically, it wants wants to get a AMAZON.Literal
of a few sentences and write it down. I found that if the user happens to say “tomorrow” in there somewhere, it sometimes matches the Date and that’s the only data I get.
The issue is that what I’m interested in is specified globally and therefore does not have the context. If we could respond with expected intents or something to that effect, conversations would be much better.
The other feature is to be able to return links in the card. When I return LinkAccount
card in a response, there is a call to action on the card in the Alexa App to do OAuth. I would like to return text and URL to put arbitrary things in the same spot. That way I could link the user to their task they just posted to create a more seamless experience.
Summary
Alexa development is fairly straightforward assuming you don’t need or already have the OAuth provider bits set up. Most of the docs talk about a Java package but doing it in the Rails environment was no trouble with existing gems or parsing the json yourself.
It’s not quite as easy for conversations but you can make it work. A few more tweaks, along with push notifications, would add a ton of value.
The TaskRabbit Skill is now published! Check it out.