Developing an Amazon Alexa Skill on Rails

In March, we had a hack day at TaskRabbit and I did a demo of posting a task using a borrowed new-ish (at the time) Amazon Echo via Alexa. For the first time in a year, I made a new engine that would handle all these new-fangled conversational UIs and bots and stuff.

The hack day came and went (I didn’t win) and this branch was just sitting there every time I did a git branch command. I only have a few there. Keep it clean, people! Then I saw the Cyber Monday deals on Amazon. I decided that it had sat there long enough so I dusted it off to try and bring it to the finish line.

I more or less started over, of course, because that’s how it goes. I thought I would document the process for anyone else on the trail.

Alexa Sessions

The Alexa API uses JSON to make requests and receive responses. Each session has a guid and (optional) user information.

The API has some cool session management tricks. You can return attributes that will also get passed back on the next request. This effectively gives you “memory” of the previous parts of a conversation. I chose to not do this because I am hoping to use the same engine for other similiar interfaces. Instead I save the same stuff but to a table table using the session guid as the key. In ether case, it’s important to know where you’ve been and what you need to move forward.

In our case, we want to check the box that says there has to be a linked user. Because this is checked, the Alexa App will send them through an OAuth flow on our site. So we generate a token that maps to the user in our system and Alexa stores that token in hers. Side note: it’s hard to not fully personify Alexa after talking (arguing) back and forth all week.

Hello World

Alexa is given a single endpoint for a skill. It will POST the request to that route. So I added the line to the routes.rb file and sent it to a new SkillsController. It looks something like this:

1
2
3
4
5
6
7
8
class SkillsController < ::ActionController::Base
  def root
    output = AlexaRubykit::Response.new
    session_end = true
    output.add_speech("Hello World")
    render json: output.build_response(session_end)
  end
end

I used the alexa_rubykit gem with some modifications to parse the request and write the response.

So how can we get the Echo on the desk to talk to the computer? It’s only 12 inches away and yet… so far! The Alexa app in the developer console has to point to a publically accessible HTTPS site. I googled around a little bit and stumbled upon ngrok. You install ngrok and run ngrok http 3000. This gives you a public https site that forwards to your localhost that you can put in the developer console.

Alexa Intents

To know what the user said involves the intents that are created in the developer console.

A simple example to get whatever the user said would look like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  "intents": [
    {
      "intent": "UserInput",
      "slots": [
        {
          "name": "Generic",
          "type": "AMAZON.LITERAL"
        }
      ]
    }
  ]
}

You also use “utterances” to give examples of this generic input.

There are also several other helpful intents that normalize data. For example, the user can say the date and time in many ways but Amazon can normalize that and send over a known format. Other examples include commands commands like yes, no, cancel, and stop.

Here are the intents I ended up with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
{
  "intents": [
    {
      "intent": "AMAZON.YesIntent"
    },
    {
      "intent": "AMAZON.NoIntent"
    },
    {
      "intent": "AMAZON.CancelIntent"
    },
    {
      "intent": "AMAZON.StopIntent"
    },
    {
      "intent": "TaskPost",
      "slots": [
        {
          "name": "Generic",
          "type": "AMAZON.LITERAL"
        },
        {
          "name": "ScheduleDate",
          "type": "AMAZON.DATE"
        },
        {
          "name": "ScheduleTime",
          "type": "AMAZON.TIME"
        }
      ]
    }
  ]
}

I used the alexa_generator gem with some updates to declare these in a way that looks like routes. It also allows you to give examples which helps generate all the files that is needed.

For example, here is my alexa.rb file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
require 'alexa_generator'

module Interactive
  class AlexaModel
    def self.get
      @instance
    end

    def self.define(&block)
      @instance = AlexaGenerator::InteractionModel.build do |model|
        yield model
      end
    end
  end
end

Interactive::AlexaModel.define do |model|
  model.add_intent("AMAZON.YesIntent")
  model.add_intent("AMAZON.NoIntent")
  model.add_intent("AMAZON.CancelIntent")
  model.add_intent("AMAZON.StopIntent")

  model.add_intent(:TaskPost) do |intent|
    intent.add_slot(:Generic, "AMAZON.LITERAL") do |slot|
      slot.add_bindings(
        'find me a handyman',
        'clean my house',
        # ... many, many things here ...
        'wait in line',
      )
    end

    intent.add_slot(:ScheduleDate, "AMAZON.DATE") do |slot|
      slot.add_bindings(
        'tomorrow',
        'today',
        'this friday',
        'thursday',
      )
    end

    intent.add_slot(:ScheduleTime, "AMAZON.TIME") do |slot|
      slot.add_bindings(
        'morning',
        'afternoon',
        'evening',
        'noon',
        'six pm',
      )
    end

    intent.add_utterance_template('{Generic}')
    intent.add_utterance_template('{ScheduleDate} at {ScheduleTime}')
    intent.add_utterance_template('{ScheduleDate} {ScheduleTime}')
    intent.add_utterance_template('{ScheduleTime} {ScheduleDate}')
    intent.add_utterance_template('{ScheduleDate}')
    intent.add_utterance_template('{ScheduleTime}')
  end
end

Running a rake job I wrote will the generate the above intents json as well as the sample utterances for the developer console.

1
2
3
4
5
6
7
8
9
TaskPost {find me a handyman|Generic}
TaskPost {clean my house|Generic}
... many, many things here ...
TaskPost {wait in line|Generic}
TaskPost {ScheduleDate}
TaskPost {ScheduleDate} at {ScheduleTime}
TaskPost {ScheduleDate} {ScheduleTime}
TaskPost {ScheduleTime}
TaskPost {ScheduleTime} {ScheduleDate}

Simple Response

A simple skill would probably have one-ish intent and few examples. It would receives those in the controller, return the response, and then end the session. We would also handle a few of the states to help the user out.

The controller might look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class SkillsController < ::ActionController::Base
  def root
    input = AlexaRubykit.build_request(params)
    output = AlexaRubykit::Response.new
    session_end = true
    message = "There was an error." # unknown thing happened

    case input.type
    when "LAUNCH_REQUEST"
      # user talked to our skill but did not say something matching intent
      message = "Say something see what happens."
    when "INTENT_REQUEST"
      case input.name
      when "UserInput"
        # our custom, simple intent from above that user matched
        given = input.slots["Generic"].value
        message = "You said, #{given}."
      end
    when "SESSION_ENDED_REQUEST"
      # it's over
      message = nil
    end

    output.add_speech(message) unless message.blank?
    render json: output.build_response(session_end)
  end
end

Conversations

It all gets a bit more complicated when there is a back and forth conversation. At this point, I would say Alexa is not yet optimized for this use case.

For example, in our app with the shown set of intents, any one of them could come through. I could ask the user a yes/no question like “Your task is ready to book. Continue?” but the user could say “clean my house” or literally… anything. So I’d be expecting a AMAZON.YesIntent but get a AMAZON.LITERAL one. At the same time, it’s very helpful to use the built in intents for their normalization capabilities. Otherwise, I’d have to do my own natural language stuff to know all the variations of dates and ways to cancel, etc.

So the trick of a conversation seems to be to know the state, know the related intents that are expected, and merge them together as best as possible. As noted, I store the state and the data collected in the database. In concept (in reality this is spread out over many classes), we add a case statement to the controller.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
class SkillsController < ::ActionController::Base
  def root
    input = AlexaRubykit.build_request(params)
    output = AlexaRubykit::Response.new
    session_end = false # probably going to keep going
    message = "There was an error." # unknown thing happened
    session = Session.find_or_initialize_by(session_id: input.session.session_id)

    case input.type
    when "LAUNCH_REQUEST"
      # user talked to our skill but did not say something matching intent
      message = "Hi. How can we help?"
    when "INTENT_REQUEST"
      case session.state
      when "selecting_category"
        category = select_category(slot_params) # uses generic
        if category
          session.category = category
          message = "What date and time?"
          session.state = "deciding_time"
        else
          message = "Sorry, missed that. Try cleaning or handyman."
        end
      when "deciding_time"
        schedule = select_schedule(slot_params) # uses date/time
        if schedule
          session.schedule = schedule
          message = "Tell us more about it"
          session.state = "adding_details"
        else
          message = "Try things like Friday at noon."
        end
      when "adding_details" # etc
      when "confirming"
        if did_confirm?(slot_params) # uses yes
          # do it!
          message = "Your task has been booked"
          session.state = "completed"
        elsif did_exit?(slot_parms)  # uses no
          session.state = "canceled"
          session_end = true
        else
          message = "Ready to confirm? Say yes or no"
        end
      when "completed"      # etc
      end
    when "SESSION_ENDED_REQUEST"
      # it's over
      message = nil
      session_end = true
    end

    session.save!
    output.add_speech(message) unless message.blank?
    render json: output.build_response(session_end)
  end

  private

  def slot_params
    # returns all the intent slots
    # e.g. {"generic" => "what they said", "schedule_date" => "2016-12-05"}
    return @slot_params if @slot_params

    @slot_params = {}
    return @slot_params unless input.type == "INTENT_REQUEST"
    input.slots.each do |name, slot|
      key = name.underscore # category_noun, etc
      value = slot['value']
      @slot_params[key] = value
    end

    @slot_params
  end
end

Using this pattern, you can have a decent conversation.

SDK Update Requests

There are two simple things that I think would make this a much better platform.

The first is to be able to handle conversations better. If I could include which intents I am expecting back from the thing I just asked, everything would be 10x better.

The issue can be seen when the app asks for more details about the app. Basically, it wants wants to get a AMAZON.Literal of a few sentences and write it down. I found that if the user happens to say “tomorrow” in there somewhere, it sometimes matches the Date and that’s the only data I get.

The issue is that what I’m interested in is specified globally and therefore does not have the context. If we could respond with expected intents or something to that effect, conversations would be much better.

The other feature is to be able to return links in the card. When I return LinkAccount card in a response, there is a call to action on the card in the Alexa App to do OAuth. I would like to return text and URL to put arbitrary things in the same spot. That way I could link the user to their task they just posted to create a more seamless experience.

Summary

Alexa development is fairly straightforward assuming you don’t need or already have the OAuth provider bits set up. Most of the docs talk about a Java package but doing it in the Rails environment was no trouble with existing gems or parsing the json yourself.

It’s not quite as easy for conversations but you can make it work. A few more tweaks, along with push notifications, would add a ton of value.

The TaskRabbit Skill is now published! Check it out.

Copyright © 2017 Brian Leonard