Architecture: Consider Kron

The last post in our architecture series discussed background processing. There is a special type of background processing that I wanted to make a quick note about. These are things that need to be done periodically or otherwise on a schedule.

In our internal speak, we call this a “kron” job. If you are familiar with cron jobs, it’s the same idea. A product manager misspelled it once and it stuck! We don’t actually use regular cron infrastructure, so the spelling nuance is helpful.

The specifics of how we implement it involve our message bus infrastructure, but I think the concept and the decisions involved could include many other implementations.

When to use it

Let’s take the job from the previous article. The “charge an invoice 24 hours later” case is an interesting one. The system certainly supports delaying that code to run for an arbitrary time, but that’s not always the best idea.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class InvoiceChargeWorker
  include TResque::Worker
  inputs :invoice_id

  worker_lock :invoice_id

  def work
    return unless needed?
    invoice.charge!
  end

  def to_id
    invoice.to_id
  end

  def needed?
    invoice.pending?
  end

  def invoice
    @invoice ||= Invoice.find(invoice_id)
  end
end

# When invoice is created
InvoiceChargeWorker.enqueue_at(24.hours.from_now, invoice_id: invoice.id)

One reason would be memory. When there a lot of invoices (woot!), we still have to save the notion of what should be done somewhere until it gets processed. In this case, the Redis instance will have it stored in memory. The memory could fill up and adding more workers won’t help because of the delay.

The second reason is stability. This is important stuff and Redis could have issues and lose the data. We made everything idempotent and could recreate everything, but it would certainly be a huge hassle.

So when enqueueing something to run in the future, especially if it is important or a long time from now (more than a few minutes), we consider kron.

Batch mode

If we were going to accomplish the same things but on a schedule, the code would have to change in some way. I like the existing worker because it already has the good stuff from the last article: source of truth, knowing whether or not it still needs to be run, and mutual exclusion. When batch processing, I believe it’s also good to still operate on this one at a time where the count (memory for redis) is low or the risk of issues is high. Both are the case here.

To turn it into a batch processor we need to know what needs to be processed at any given moment. This is easy to determine because we have the needed? method. It looks to be invoices that are in the pending state. Sometimes we need to add a state column or other piece of data to know what needs to be in the batch but in this case we are good to go.

From there we can decide if we are going to update the class as-is or make a batch worker. A batch worker is its own worker and would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class InvoiceChargeBatchWorker
  include TResque::Worker

  worker_lock :all
  queue_lock  :all

  def work
    Invoice.where(stat: 'pending').find_each do |invoice|
      InvoiceChargeWorker.enqueue(invoice_id: invoice.id)
    end
  end
end

# process all pending invoices
InvoiceChargeBatchWorker.enqueue()

That’s it. Because of the worker lock on InvoiceChargeWorker and the state checking, it would be ok even if we were to enqueue it twice or something. Making a custom batch worker also prevents us from running this code twice.

We could also stick it as a class method on the original:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class InvoiceChargeWorker
  include TResque::Worker
  inputs :invoice_id

  worker_lock :invoice_id

  def self.process_all!
    Invoice.where(stat: 'pending').find_each do |invoice|
      self.enqueue(invoice_id: invoice.id)
    end
  end

  def work
    return unless needed?
    invoice.charge!
  end

  def needed?
    invoice.pending?
  end

  def invoice
    @invoice ||= Invoice.find(invoice_id)
  end
end

# process all pending invoices
InvoiceChargeWorker.process_all!

How it works

Again, in any given architecture there is probably a best way to do it. For example, maybe this is a good way to do it on top of Mesos.

The challenge is running something on a schedule. In this case, process all invoices that need to be paid. That is what regular cron is made to do. However, we do not want to run that on every box. If we did that, we would have serious race conditions and might pay an invoice twice. Rather, we want to run it once globally across the entire infrastructure or at least per service.

We could probably do this by noting in the devops setup that one of the servers is special. It should get the cron setup. We could use something like the whenever gem to say what to do and we would only run that on one box per system. It needs to be per system because it has to be able to know what worker to enqueue or, in general, what code to run.

What we do instead is have a single service that has a process that sends out a heartbeat on the message bus. Every minute, it publishes an event that looks like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  # for Tue, 11 Apr 2017 00:25:00 UTC +00:00
  # epoch time: 1491870300

  QueueBus.publish(heartbeat_seconds", {
    "epoch_seconds"=>1491870300,
    "epoch_minutes"=>24864505,
    "epoch_hours"=>414408,
    "epoch_days"=>17267,
    "minute"=>25,
    "hour"=>0, 
    "day"=>11,
    "month"=>4,
    "year"=>2017,
    "yday"=>101,
    "wday"=>2
  })

The current code for the process is already checked into queue-bus and ready to use here.

Resque bus supports this using the resque-scheduler gem. It is setup off by calling QueueBus.heartbeat!. We make sure it’s setup every time we start up Resque.

1
2
3
4
5
6
7
8
9
namespace :resque do
  task :setup => [:environment] do
    require 'resque_scheduler'
    require 'resque/scheduler'
    require 'tresque'

    QueueBus.heartbeat!
  end
end

This setup is automatically called every time Resque starts.

Usage

So now we can subscribe to this event to run something every minute, hour, day, Monday, month, whatever.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# every minute
subscribe "every_minute", 'bus_event_type' => 'heartbeat_minutes' do |attributes|
  InvoiceChargeWorker.process_all!
end

# every hour: 4:22, 5:22, 6:22, etc
subscribe "once_an_hour", 'bus_event_type' => 'heartbeat_minutes', 'minute' => 22 do |attributes|
  InvoiceChargeWorker.process_all!
end

# every day at 12:05 am
subscribe "once_a_day", 'bus_event_type' => 'heartbeat_minutes', 'hour' => 0, 'minute' => 5 do |attributes|
  InvoiceChargeWorker.process_all!
end

# every monday at 1:52 am
subscribe "early_monday_morning", 'bus_event_type' => 'heartbeat_minutes', 'wday' => 1, 'hour' => 1, 'minute' => 52 do |attributes|
  InvoiceChargeWorker.process_all!
end

# the 3rd of every month at 2:10 am
subscribe "once_a_month", 'bus_event_type' => 'heartbeat_minutes', 'day' => 3, 'hour' => 2, 'minute' => 10 do |attributes|
  InvoiceChargeWorker.process_all!
end

# every 5 minutes: 4:00, 4:05, 4:10, etc
subscribe "every 5 minutes" do |attributes|
  # if it doesn't fit the subscribe pattern, just subscribe to every minute and use ruby
  next unless attributes['minute'] % 5 == 0
  InvoiceChargeWorker.process_all!
end

Summary

So that is how “kron” works.

Over time, we have decided this is a much more reliable way to process items in the background when a delay is acceptable. By setting up some sort of centralized architecture for this, many services and subscribe in a way that is familiar and unsurprising. We have found a lot of value in that.

Architecture: Background Processing

So we have a bunch of models and are doing stuff with them in service objects. The next thing we might need is to process some code in the background.

Not everything can be done inline from the API request. For example, we might need to geocode a user’s postal code when they change it in their account. Or when an invoice is created, we want to charge it 24 hours later.

When working with background jobs, we default to the following practices:

  • Workers are enqueued with a dictionary of inputs
  • These inputs should be used to fetch data from the source of truth
  • Workers know how to check if they still need to run
  • Locking schemes should protect parallel execution

Enqueue

When we enqueue a worker, we have found that it’s quite helpful to always use a dictionary (hash) of key/value pairs. Resque and Sidekiq both take a list of arguments like so:

1
2
3
4
5
6
7
8
9
class HardWorker
  include Sidekiq::Worker
  def perform(name, count)
    # do something with name, count
  end
end

# enqueue
HardWorker.perform_async('bob', 5)

This has proved to be problematic when adding new parameters or having optional parameters. For example, if we add a new (third) input parameter, there might be stuff in the queue with the old two. When the new code gets deployed, it will throw an ‘invalid number of arguments’ type of error. When using a hash, we can give it a default, fail gracefully, or do whatever we like on a class by class basis.

So to provide better change management and optional arguments, we always do it like so:

1
2
3
4
5
6
7
8
9
10
11
class HardWorker
  include TResque::Worker
  inputs :name, :count

  def work
    # do something with self.name, self.count
  end
end

# enqueue
HardWorker.enqueue(name: 'bob', count: 5)

Source of Truth

Let’s say we want to update a search index every time a user record is changed. We need to write their first name, last name, etc to Elasticsearch.

We could do something like this:

1
2
3
4
5
6
7
8
9
10
11
class UserIndexWorker
  include TResque::Worker
  inputs :id, :first_name, :last_name, :etc

  def work
    Elasticsearch.index('users').write(id, id: id, first_name: first_name, last_name: last_name, etc: etc)
  end
end

# When user changes
UserIndexWorker.enqueue(user.attributes.slice(:id, :first_name, :last_name, :etc))

This certainly would work, but is not considered best practice. It is better to be idempotent. It writes everything that should ) by passing the minimal information to the background worker, who then looks up the source of truth. That way, if there is any delay between when it is enqueued and run, it will still send the correct information.

The better approach would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class UserIndexWorker
  include TResque::Worker
  inputs :user_id

  def work
    Elasticsearch.index('users').write(user.attributes.slice(:id, :first_name, :last_name, :etc))
  end

  def user
    @user ||= User.find(user_id)
  end
end

# When user changes
UserIndexWorker.enqueue(user_id: user.id)

In the same vein, the worker should be in charge of whether or not it needs to do anything in the first place. For example, we can enqueue a worker to run later about an Invoice. If, at that time, the payment is Invoice still should be charged, then charge it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class InvoiceChargeWorker
  include TResque::Worker
  inputs :invoice_id

  def work
    return unless needed?
    invoice.charge!
  end

  def needed?
    invoice.pending?
  end

  def invoice
    @invoice ||= Invoice.find(invoice_id)
  end
end

# When invoice is created
InvoiceChargeWorker.enqueue_at(24.hours.from_now, invoice_id: invoice.id)

This is another example of single source of truth. Even for jobs that are run immediately, this check is something we always put in place: return immediately if the worker is no longer relevant.

Mutual Exclusion

Let’s say the User object can sometimes change a few times rapidly. The “source of truth” approach will make sure the right thing always gets indexed. So that’s great. But it is pretty silly to index the same data twice or more times, right?

In this case, we add a queue lock. The effect is that if something is in the queue and waiting to be processed and you try to enqueue another one with the same inputs, then it will be a no-op. It looks like this:

1
2
3
4
5
6
class UserIndexWorker
  include TResque::Worker
  inputs :user_id

  queue_lock :user_id
end

Another case that often arises is mutual exclusion for runtime. Maybe weird payment things happen to the payment service if two invoices for the same user are happening at the same time.

In this case, we add a worker lock. The effect is that if something is in the queue and about to start running and there is another running at that moment, then it will re-enqueue itself to run later. It looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class InvoiceChargeWorker
  include TResque::Worker
  inputs :invoice_id

  worker_lock :to_id

  def work
    return unless needed?
    invoice.charge!
  end

  def to_id
    invoice.to_id
  end

  def needed?
    invoice.pending?
  end

  def invoice
    @invoice ||= Invoice.find(invoice_id)
  end
end

For either type, you don’t have to lock on all the attributes or can (as shown in the last example) use calculations. The namespace of the lock is the worker class name. You can also set the namespace to allow locking between different workers.

Message Bus

Our message bus and our use of background processes have a lot in common. In fact, the message bus is built on top of the same background processing infrastructure. The question that arises is this: when should something be enqueued directly and when should it publish and respond to a bus subscription?

The first note is that you should always be publishing (ABP). It doesn’t hurt anything to give (optional) visibility to other systems what is happening. Or use this as logging framework.

Just publishing, however, doesn’t mean we have to use that to do work in the background. Be can bother publish and enqueue a background worker. We enqueue a worker when the work in the background is essential to the correct operation of the use case at hand.

One example to enqueue directly would be the geocoding worker I mentioned earlier: when the user gives a new postal code, figure out where that is. It’s key to the account management system.

The search example I’ve been using might not actually be the best one because we would have the search system subscribed to changes in the account system. What I didn’t show that the enqueue call might actually happen from within a subscription.

1
2
3
subscribe "user_changed" do |attributes|
  UserIndexWorker.enqueue(user_id: attributes['id'])
end

So these two concepts can work together. Why not just index it right in the subscription, though? A primary reason might be to use some of the locking mechanisms as the bus does not have that. It also might be the case that the worker is enqueued from other locations and this keeps things DRY. The worker is also easier to unit test.

TResque

We use Resque as a base foundation and built on top of it with an abstraction layer called TResque. That’s TR (TaskRabbit) Resque. Get it? It puts all of these practices into place as well as adding and abstraction layer for the inevitable, but as yet unprioritized, move to Sidekiq.

I don’t necessarily expect anyone to use this, but it doesn’t hurt to make it available as an example of how we are using these tools.

You define a worker and enqueue things as show in the examples above. Then only layer left is around prioritization. You can give a queue name to a worker and then register what priority those workers are. If no queue is given, it is assumed to be the default queue.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
require 'tresque'

module Account
  class RegularWorker
    include ::TResque::Worker
    # defaults to account_default queue
  end
end

module Account
  class RegularWorker
    include ::TResque::Worker
    queue :refresh # lower priority account_refresh queue
  end
end

TResque.register("account") do
  queue :default, 100
  queue :refresh, -5000
end

Then when you run Resque, you can use these registrations to process the queues in the right order.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
require 'resque/tasks'
require 'resque_scheduler/tasks'
require "resque_bus/tasks"

namespace :resque do
  task :setup => [:environment] do
    require 'resque_scheduler'
    require 'resque/scheduler'
    require 'tresque'
  end

  task :queues => [:setup] do
    queues = ::TResque::Registry.queues
    ENV["QUEUES"] = queues.join(",")
    puts "TResque: #{ENV["QUEUES"]}"
  end
end
1
2
  $ bundle exec rake resque:queues resque:work
  TResque: account_default, account_refresh

This registration layer allows each of the systems (engines) to work independently and still have centralized background processing.

Architecture: Surface Area

The last post in the TaskRabbit architecture series was about service objects. This an example of what I call minimizing “surface area” of the code.

Frankly, I might be using the term wrong. It seems possible “surface area” usually refers to API signature of some objects. What I’m talking about here is the following train of thought:

  • I change or add a line of code
  • What did I just affect?

The “surface area” is the other things I have to look over. It is the area that I have to make sure has appropriate test coverage. Having a large surface area is what slows down development teams. The goal is to minimize it.

Service Objects

So how does our use of service objects relate to this concept?

Let’s say we have a new requirement that’s applicable when a Tasker submits an invoice that modifies what gets saved. If I were to add the code to the InvoiceJobOp from the previous article, then it will only apply when the Op is run. If we were to do something in a before_save in the Invoice model, then it might accidentally kick in anytime an Invoice is changed.

That’s a lot more tests and things to keep in our mind. If it is just in the Op, that is less of those kinds of debt, so adding in the Op is an example of minimizing the surface area of the change.

Namespacing

We went through a roundabout journey to end up where were we are. Many of the changes were about surface area and trying to reduce it.

People like microservices and SOA because of this same principle. We tried it and that part of it worked out really well. There was just no way that a change in service A could affect service B. As discussed, however, we ran into issues in other dimensions.

Our current use of engines follows the same approach to achieve the same surface area effect. It is all about namespacing. Modifying the user management engine can not affect the marketplace engine. This allows us to proceed with more confidence when making such changes.

A particular aspect of our setup is that any given model is “owned” by only one engine. The rest of the engines are allowed to read from the database but they cannot write. This provides sanity and minimizes the surface area. For example, the validations only need to live in one spot. You also know that no other code can go rogue and start messing with the data by accident or otherwise.

Bus

Of course, the world isn’t always cut and dry. Venn diagrams overlap. No abstraction or encapsulation is perfect. The seams in namespacing show up when something that happens in one service (engine) needs to affect something in another one.

For example, we were so happy just a few paragraphs ago that changes to the user management engine do not affect the marketplace engine. That is true and it is great. There is no direct effect from the code. However, as they tend to do, these pesky functional requirements always mess up perfect plans for the code. In this case, when a user changes their first name (in the account engine), the marketplace engine might need to update some data in Elasticsearch.

We use a message bus to observe changes like this and react as appropriate.

1
2
3
4
5
# Whenever the user changes
subscribe 'user_may_have_changed', bus_observer_touched: 'user' do |attributes|
  # update the profile in ElasticSearch
  ProfileStoreWorker.enqueue(user_id: attributes['id'])
end

An important note here is that ProfileStoreWorker is idempotent. It writes everything that should go in Elasticsearch every time. This technique reduces surface area by not depending on this single event and its contents, but rather only as a trigger.

One might say that these subscriptions are just as coupled as doing everything all in one spot. I see that point because, of course, the same things end up happening. However, we have this technique to be better for a few reasons.

  • The trigger code (in the account engine) does not need to know about the rest of the system. It can mind its own business.
  • The subscribing code (in the marketplace engine) can be self-contained instead of being mixed up in the trigger code path.
  • Many different code paths might necessitate the ProfileStoreWorker to run. By decoupling it, we actually save complexity in many code paths.

Summary

In code, developers tend to weave a tangled web wherein seemingly innocuous changes have far-reaching effects. We have been able to create more stable and agile code by considering the “surface area” of a change and minimizing it through some encapsulation and decoupling techniques.

Architecture: Service Objects

This is the second post in what is now indisputably a “series” of articles about how we build things at TaskRabbit. Over time, we/I have internalized all kinds of lessons and patterns and are trying to take the time to write some of the key things down.

Building upwards from the last article about models, let’s talk about how we use them. The models represent rows in the database in the Rails ORM. What code is deciding what to put in those rows and which ones should be created, etc? In our architecture, this role is filled by service objects.

Overall, we default to the following rules when using models in our system:

  • Models contain data/state validations and methods tied directly to them
  • Models are manipulated by service objects that reflect the user experience

Something has to be fat

In the beginning, there was Rails and we saw that it was good. The world was optimized around the CRUD/REST use cases. Controllers had update_attributes and such. When there was more logic/nuance, it was put there in the controller (or the view).

There was a backlash of sorts against that and the new paradigm was “Fat model, skinny controller”. The controllers were simple and emphasized workflow instead of business logic. Views were simpler. That stuff was put in the models. Model code was easier to reuse.

Thus arose the great “God Model” issue. Fat is one thing, but we had some seriously obese models. Things like User and Task simply had too much going on. We could put stuff in mixins/concerns but that didn’t change the fact that there was tons of code that all could be subtly interacting with each other.

Business logic has to go somewhere. For us, that somewhere is in service objects.

Operations

In our architecture, we call them “Operations” and they extend a class called Backend::Op. This more or less uses the subroutine gem.

Much can be read about what it means to be a service object, but here is my very scientific (Rails-specific) definition.

  • Includes ActiveModel stuff like Naming, Validations, and Callbacks
  • Allows declaration of what fields (input parameters) it uses
  • Reflects an action in the system like “sign up a user” or “invoice a job”
  • Does whatever it needs to do to accomplish the action when asked including updating or creating one or more models

Here’s a simplified example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class InvoiceJobOp < ::Backend::Op
  include Mixins::AtomicOperation # all in same transaction

  field :hours
  field :job_id

  validates :job_id, presence: true
  validate  :validate_hour       # hours given
  validate  :validate_assignment # tasker is assigned
  # ... other checks

  def perform
    create_invoice!    # record hours and such
    generate_payment!  # pending payment transaction
    appointment_done!  # note that appointment completed

    if ongoing?
      schedule_next_appointment! # schedule next if more
    else
      complete_assignment!       # otherwise, no more
    end

    enqueue_background_workers!  # follow up later on stuff
  end
end

No Side Effects

When we followed the “Fat Model” pattern, we got what we wanted. This was usually methods in one of the models. Sometimes there were callbacks added. These were the most dangerous because they were happening on every save. Often, this added unnecessary side effects.

With the service object approach, it is very clear what is happening for the action at hand. When you “invoice a job,” you create the invoice, generate the payment, mark the appointment done, schedule the next appointment, and enqueue some background workers.

This certainty leads to less technical and product debt. When something new needs to be added to this action, it’s very clear where it goes.

Errors

Our Op class above does several model manipulations to the related invoices, appointments, etc. Each some of these does a save to something. Those save calls could raise errors. If any of those raise an error, then the Op itself will inherit it and it will be available on the op.errors method just like a normal ActiveRecord object.

This also allows chaining of operations. If there was a ScheduleAppointmentOp class, it could be used in the above schedule_next_appointment! method. If it raised an error, it would propagate to the InvoiceJobOp.

Controllers

Generally speaking, we have one Op per controller action that declares what it expects and manipulates the backend data as needed.

Here is a typical example from one of our controllers.

1
2
3
4
5
6
7
8
9
class JobsController < ApplicationController
  def confirm
    @job = Job.find(params[:id])
    authorize @job, :confirm? # authorization
    op = Organic::JobConfirmOp.new(current_user)
    op.submit!(params.merge(job_id: @job.id)) # perform action
    render :show # render template
  end
end

An action will typically do the following:

  • Load a resource
  • Authorize the user is allowed do do an action
  • Perform the action with an operation (other things are in place to render and error if the op fails)
  • Render a template

Note that this is clearly not a typical RESTful route. We’ve found that becomes less important when using this pattern. When the controllers are just wiring things up and are all a 5 lines or less, it feels like there is more flexibility.

It probably gets summed up something like this: wherever the fat (real work) is, that should be focused. For us, it’s not the controller because of service objects. The real work is 1 to 1 focused with the use case. If more was in the controllers, we’d probably be closer to the standard index, show, etc methods because of the focus concept.

Sharing

So we have pushed everything out closer to the user experience and away from the models. But what if something is needed in a few pieces of the experience?

A few ways we have done sharing:

  • Two Ops can use a lower-level one or other type of class as noted above.
  • Two Ops can have a mixin with the shared behavior.
  • We can add a method to an applicable model. We tend to do this on simple methods that are interpreting the model data to answer a commonly-asked question or commonly-used display value.

Summary

We have found that this approach provides a more maintainable and overall successful way of building Rails apps.

Architecture: Models

This is the first post in what I hope will be a series of articles about how we build things at TaskRabbit. Over time, we/I have internalized all kinds of lessons and patterns, but have never written them down explicitly and publicly. So let’s give that a try.

I thought we’d start with models. That’s what Rails calls database tables where each row is an instance of that model class.

Overall, we default to the following rules when designing the models in a system:

  • Keep the scope small and based on decisions in the workflow
  • Use state machines to declare and lock in the valid transitions
  • Denormalize as needed to optimize use cases in the experience

Scope

When designing a feature (or the whole app in its early days), you have to decide what the models represent. I’m calling that the “scope” of the model.

For example, most applications have a User model. What columns will it have? Stuff about the user, obviously. But what stuff? One of the tradeoffs to consider is User vs. Account vs. Profile. If you put everything about the user in the same table as the one that’s pointed to in many foreign keys through the system, there will be a performance impact.

So we put the most commonly needed items on every screen load in the User model and “extra” stuff in the Profile.

  • User: authentication, name, avatar, state
  • Profile: address, average rating, bio information

There are plenty of ways to cut this up into other models and move things around, but that’s what I mean about “scope” of a model.

States

State machines are built into the foundation of the system. Almost every model has a state column and an initial state. There are then valid transitions to other states.

For example, there is a PaymentTransaction model. It has an initial “pending” state that represents the time between when an invoice is submitted and when we charge the credit card. During this time, it can move to a “canceled” state if it should not happen. Or, if things go as planned, it can transition to a “settled” state. After that, if there is an issue of some sort, it would go to a “refunded” state. Notably, going from “pending” to “refunded” is not a valid transition.

Creating these state and transitions preserves some sanity in the system. It’s a safety check. By asserting what is possible, we can (try to) prevent things that should not be possible.

Nouns and Verbs

The TaskRabbit marketplace creates a job that is sent to a Tasker. The Tasker can chat with the Client and can say they will do the job. Or they can decline. If they agree, they are officially assigned to the job and make an appointment. When they complete the job, they invoice the Client for the time worked. In most cases, it’s done at that point. In other cases, it is “ongoing” where they come back next week (to clean again, for example). At more or less any time, the whole thing can be canceled.

If given that description, you could come up with many possible model structures. They would all have a set of pros and cons, but many would work out just fine.

For example, you could have a Job model with these kinds of states: invited, invitation_declined, assigned, appointment_made, invoiced, invoice_paid, canceled, etc. Each would only allow the valid transitions as described above. You would also need the columns to represent the data: client_id, tasker_id, appointment_at, etc.

The main benefit of this approach is centrality. You can SELECT * FROM jobs WHERE client_id = 42 and get all of that user’s situation. Over time, however, we came to value a more decentralized approach.

Now, the models of our system reflect its objects and decisions that the actors make about them. Each fork in the experience has a corresponding model with a simple state machine.

For example, the Invitation model is created first to note the decision the Tasker must make. It then either transitions to accepted or declined. If accepted, it spawns an Assignment. It, in turn, can move to states like completed or ongoing.

There is still the the Job model but it contains the “description” of the work to do and its id ties together the decision-based models.

Trade-offs

Everything is pros and cons. The decentralized approach has more global complexity (more objects and interactions) but less local complexity (simpler decisions, states).

It seemed to be the single, monolithic state machine that doomed the single Job model. Everything is fine as long as that’s the only path through the system. However, as soon as there is a new way for a Task to be assigned, we have a tangled web of states.

Not every task has the invitation pattern noted above. Some are “broadcast” to many Taskers at once and shown in a browse-able “Available Tasks” section in the their app. That’s a new fork in the experience. Ongoing tasks also create a state loop of sorts.

These cause the single state machine to get a bit tangled up, but is more easily handled in the decentralized approach. We can make a Broadcast model instead of an Invitation one. That can have its own set of states. Success in that local state machine can also spawn an Assignment and everything goes on as before.

Denormalization

To try and get the best of both worlds, we have also aggressively embraced a variety of forms of denormalization.

We actively try not to do SQL JOINs for simplicity and performance reasons, but that is at odds with all these little models all over the place. So we have said it’s OK to have duplicate data. For example, each of these “decision” models have the client_id, tasker_id, and pricing information. It just gets passed along. This makes everything a local decision and queries very straightforward.

The big hole in the decentralized approach is to “get all my stuff” easily. For that we have different tactics, both of which are denormalization with use cases in mind.

On write to an object, we can update a central model with the current situation for that Job. For example, when an Assignment gets created, we recalculate and store data in two different tables. One for both the Tasker and the Client on what they should be seeing on their respective dashboards. Thus, the API call to “get all my stuff” uses one of those tables. That is done in the same transaction as the original write.

The other option is basically the same thing but for either less time-sensitive data or more complicated queries. We use a message bus to observe changes. We then denormalize applicable data for a specific use case into a table or Elasticsearch. For example, when an Appointment is created, we would update the Taskers availability schedule in the database. Updating this schedule would also trigger an update to our recommendation algorithm which uses Elasticsearch.

One important note: all of these denormalizations should be idempotent. This allows us to recreate the whole thing from the source of truth or recover if any given event is dropped.

Summary

At TaskRabbit, we default to the following rules when designing the models in a system:

  • Keep the scope small and based on decisions in the workflow
  • Use state machines to declare and lock in the valid transitions
  • Denormalize as needed to optimize use cases in the experience

As always, these are just the default guidelines. In any given case, there may be a reason to deviate, but it would have to be clear why that case was special.

Developing an Amazon Alexa Skill on Rails

In March, we had a hack day at TaskRabbit and I did a demo of posting a task using a borrowed new-ish (at the time) Amazon Echo via Alexa. For the first time in a year, I made a new engine that would handle all these new-fangled conversational UIs and bots and stuff.

The hack day came and went (I didn’t win) and this branch was just sitting there every time I did a git branch command. I only have a few there. Keep it clean, people! Then I saw the Cyber Monday deals on Amazon. I decided that it had sat there long enough so I dusted it off to try and bring it to the finish line.

I more or less started over, of course, because that’s how it goes. I thought I would document the process for anyone else on the trail.

Alexa Sessions

The Alexa API uses JSON to make requests and receive responses. Each session has a guid and (optional) user information.

The API has some cool session management tricks. You can return attributes that will also get passed back on the next request. This effectively gives you “memory” of the previous parts of a conversation. I chose to not do this because I am hoping to use the same engine for other similiar interfaces. Instead I save the same stuff but to a table table using the session guid as the key. In ether case, it’s important to know where you’ve been and what you need to move forward.

In our case, we want to check the box that says there has to be a linked user. Because this is checked, the Alexa App will send them through an OAuth flow on our site. So we generate a token that maps to the user in our system and Alexa stores that token in hers. Side note: it’s hard to not fully personify Alexa after talking (arguing) back and forth all week.

Hello World

Alexa is given a single endpoint for a skill. It will POST the request to that route. So I added the line to the routes.rb file and sent it to a new SkillsController. It looks something like this:

1
2
3
4
5
6
7
8
class SkillsController < ::ActionController::Base
  def root
    output = AlexaRubykit::Response.new
    session_end = true
    output.add_speech("Hello World")
    render json: output.build_response(session_end)
  end
end
Read on →

Post Election

I’ve been thinking a lot about the book The City & the City by China MiĆ©ville. It describes a town in which two sets of people share the same physical space but do not acknowledge each other. For that matter, they are forbidden to do so.

I remember being surprised in 1992 when Bill Clinton won the election. I lived in Texas and everyone I knew was voting for Bush. Now, I live in California and the same thing snuck up on me again this week. There’s very little learning there.

But it’s not just where I live because technology has allowed a nation like in the book. Today, my Twitter feed is filled with anxiety, sadness, outrage, and very scared people. I am certain there are people nearby, not to mention in all those (many) red states, that have an exceptionally different feed: one full of hope, expectation, and triumph. And there is no connection between the two.

In that 1992 election, it was the economy (stupid) and the need for change. I’m certainly not a political analyst, but that rings just as true this week. All the jobs numbers are up over the last 8 years, but not everywhere and not for everyone. This has caused a rift.

Forward

Where do we go from here? I believe it’s best for each of use to use our talents and position as leverage to make a difference. In my case, I can help literally provide work in these areas of the country.

TaskRabbit has been focusing on its largest markets because there is still plenty of room to grow there. And the whole thing is a hard problem. The focus helps, but our map is somewhat bare in Middle America. For me, this election is a kick in the pants to get there sooner than later.

The biggest fail there would be to believe that we can “save” people from on high by bestowing the magic of technology. Fortunately, even as the chief technologist at TaskRabbit, I understand that’s not where the value lies. It’s always been about neighbors helping neighbors. We’re just there to make the real-life connection.

I don’t know about you, but I think we could all really use a few more real-life connections at the moment.

React Native Android Launch

Yesterday, we launched our updated Tasker app to our Android community. As noted before on the iOS launch, this is the app that Taskers use to get their work done. This completes our migration to React Native.

All of the credit goes to the team that made this happen, especially JR and Jeremy. It was a lot harder than expected to get everything working on both platforms and they showed great dedication and persistence.

Approach

The goal of the last release was for the iOS users to not even notice. Mission (more or less) accomplished! However, for this one, it didn’t make sense to fork the code by platform without a good reason. So admittedly, the app looks more like an iOS app and than an Android one. However, we did go screen by screen looking for places where Android-specific attention would help the user.

Differences

I’ll let JR do a followup to his previous post of all the differences, but the biggest ones in my mind were the handling of the hardware back button and different pickers (date, for example) in the forms. The Platform directory strategy we had put in place during the first cycle worked out pretty well.

We also really struggled with getting push notifications right. This is key for our business and there were many more nuances on the Android platform to work out. We hope to publish what we came up with.

Of course, there was also the more extensive testing to do. Our Android pile:

Stats

  • App Javascript: 302 files with 21515 lines of code
  • Test Javascript: 47 files with 5708 lines of code
  • iOS Javascript: 19 files with 449 lines of code
  • Android Javascript: 19 files with 770 lines of code
  • Objective C: 17 files with 885 lines of code
  • Java: 15 files with 912 lines of code
  • iOS Config files: 18 files with 2538 lines of stuff
  • Android Config files: 16 files with 1106 lines of stuff
  • React Components: 124
  • Screens (addressable url patterns): 25
  • Avg. components per screen: 5
  • Dispatcher Events: 55
  • Shared JS (vs. Platform JS) percentage: 94%
  • JS (vs. Native ObjC/Java) percentage: 92%
  • Total shared code percentage: 87%
  • Total shared (including config) percentage: 75%

Next steps

Now, we certainly didn’t do this engineering project because the tech was cool (even though it is). We did it to create a foundation that allows us to deliver value more effectively to our community. So that’s the next order of business. Time to get rolling! I estimate that we can ship features to both platforms at least twice as quickly with half the engineers than we had before.

Too long, but did read anyway: For you execs out there that somehow read this far (even past lines of code counts!), I’d say that we’ve found React Native to be at least 5x more productive than traditional mobile development.

React Native Launch

This week, we launched our updated Tasker app to the community. This is not the app on the app store, but rather the one the Taskers use to get their work done. Functionally, not much has changed since the last release. But underneath, the app has been completely rewritten in React Native.

First and foremost, a huge congratulations needs to go out to the team, especially JR. I’ve never been on an engineering project that went as smoothly or was as fun as this one. So good!

Prototype

I started looking into React Native in the beginning of August. It’s a really great story. First of all, it’s React. I’ve now decided that’s just about the best thing out there. Secondly, I was 10x more productive than when developing a regular iOS app. Finally, there was a decent chance much of the code could be reused on Android. If every feature did not have to be developed twice, we could choose to develop twice as many features or have half the engineers work on something else.

I created a prototype to see if I thought it could work. The main concerns I had we around navigation, push notifications, other native features, data storage, and just getting the right project structure. More or less, I ended up with the React Native Sample App that we published. All of those things showed great promise or at least that they were possible.

Read on →

React Native Integration Tests

Coming from a Rails background, we are very familiar with testing our code. While writing our new React Native app, we found ourself missing a way to test it and ship with confidence. I’ve updated the sample app with the approach we are using for integration tests.

Test Levels

When thinking about what and how to test a React Native app, a few levels come to mind:

  • Unit: Testing some pure Javascript object and it’s methods. Just run in Javascript.
  • Component: Testing a React component in isolation. You’d want to check its reaction to various state and props. Maybe run just in Javascript with heavy stubbing or in the simulator.
  • Integration: Testing a single screen or workflow in the “real” app. Run in the simulator or on the device.

The approach shown here is the last one: integration testing. We did this one first because if you are only going to do one of the above, it is probably your best bet. By actually testing out what the user does, you get the highest level of “don’t screw it up” coverage.

There are some tradeoffs in this choice. They mostly stem from the fact that it’s the slowest (runtime) approach. Because of that, to test many edges cases takes f-o-r-e-v-e-r to actually run the tests. Something lower-level without the simulator would be much faster.

Running Tests

In the sample app, you follow these steps:

  • Make sure you have the 9.0 simulators installed in XCode
  • Compile app for the test environment: npm run compile:test
  • Launch simulator and tests: npm test

Running npm test will launch the simulator and the robots take over.

Read on →
Copyright © 2017 Brian Leonard