To Rewrite or Not to Rewrite?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
To rewrite, or not to rewrite- that is the question: 
Whether 'tis better for the product to suffer
The features and debt of outrageous history
Or to once again battle a sea of edge cases,
And by forgetting relive them. To wish- to hope-
No more; and by hope to say we end
The heartache, and the thousand unnatural cases
That code can error to. 'Tis a codebase
Devoutly to be wish'd. To wish- to hope.
To hope- perchance to rebuild: ay, there's the rub!
For in that hope of clarity what simplicity comes
When we have removed this outdated cruft
Must give us success. But give the respect
To the current repo of such long life.
For who would bear the features of the past,
High expectations, the race conditions,
The admin tools, the product delay,
The exhaustion overcome, and the data
That shall posthaste be moved to a new store
As each mistake of the past be brought back
With sighs of regret? Who would these issues bear,
To toil and code under a weary life,
But that the chance of something rebuilt
That undiscover'd codebase, from whose lines
No complexity returns- tempts the will,
And makes us choose between those ills we have
Than sprint towards others we know not of?
Thus the unknowns make cowards of us all,
And thus the heavy weight of such choice
Is oft tempered by promises of thought,
And refactorings of great scope and breadth
With this regard the hope does turn awry
And lose the name of action.- What say you?
The lauded pivot! Siren of opportunity
May all our sins be forgotten.

The internal struggle of the rewrite decision eats away at developers. It could be so much better. We have learned so much. Let’s start over. It causes inaction over months accompanied by much grumbling. But if you do it, how can you make sure it doesn’t turn into a tragedy?

I can’t say that I am happy or proud that we have rewritten TaskRabbit twice. That doesn’t feel right. Conceptually, if we would have done it correctly the first time, then it wouldn’t have been needed. Or maybe we should have done it in place. I would say that’s absolutely fair, but doesn’t capture the reality of development of the last 6 years.

When I started writing this post, I just felt like mapping my existential crisis to Hamlet’s and now I’m heading towards defending myself against Joel’s famous post stating the fact that you should never do a rewrite. I just went back and read it (again) and I (still) agree. It’s hard to argue with. Maybe it’s best to discuss the times we did rewrite to the times we didn’t.

Refactor

The V2 timeline notes the rewrites and some of the major refactoring efforts that we’ve gone though. There were obviously many times that we did not rewrite the whole system. Ha.

A few of those projects:

  • Switching out Delayed Job for Resque
  • Refactoring the ratings system
  • Extracting local services out into external ones
  • Allowing multiple Taskers on a Task (1 to N change)
  • Making it possible to have hourly rates.
  • Doing more things asynchronously using Resque Bus
  • Allowing users to “half sign up” for the site

Most of these things were somewhere on the spectrum between features and major refactors, but all of them had some key components that might have been a trigger to consider a rewrite.

Usually, it’s when some underlying assumption is just no longer the case. For example, a task no longer has a single Tasker, but rather can have many. Or the current_user might only be partially “logged-in” to the site. Much of the code has to be touched to undo that assumption.

Or maybe it’s a data migration/timing issue. When switching background job processors, there is plenty of coordination to do. When changing the table(s) that data is stored in there is a double-write situation like in a completely new system. This is because they are new systems, just in the shell of the current one.

Service-Oriented

That architectures move towards being service-oriented seems to be common knowledge. We found that there are various pros and cons with the approach. However, I would say that what we did was a type of rewrite.

It’s a more gradual and sustainable version, though, because it’s a continuum. Very gradually, we moved functionality to new apps that leveraged the original app’s APIs. The stuff inside that shell didn’t really change. It just got a new face and became the data provider.

It seems likely that something like this is the recommended path of handling a rewrite. First, you draw a line around the system that needs the overhaul. Then you encapsulate that system and expose an API. You write lots of tests on the API and have other things depend on it. Then you swap in the system. Ideally, you are double-writing just like in the minor refactor so you can do it gradually and in parallel to see issues.

Rewrite

So what is the right time to make a completely new shell (app/repo)? I’ll agree that the correct answer could be “never.” However, the siren song of the full rewrite is strong.

The main thing to understand is that the goal was to test a new business model. We had experimented with many different ways to get tasks done and thought that we now knew the single, best way. The “single” is the important part there. As noted, the current codebase had support for many iterations and combinations that were created in search of product-market fit. While it would have been technically possible to shoehorn the new model in as yet another variation, we were already overrun with combinations.

The second note is that this was to be a test in a new market. Specifically, we were going to launch this test in London. While Londoners do speak English, we really wanted to do full translation the right way on the whole site. It would have taken a really long time to do i18n right in the current app. It was just not build with that in mind. And the majority wouldn’t have been needed. To do it correctly would have also meant spreading the notion of “locale” through the entire ecosystem including the payment system, database, background workers, etc. Overall, it was much easier to start with the requirement of i18n than bolt it on.

The main locale changes could have taken place in the core app and most of the translation could have occurred in another SOA app that used the APIs just like our US app. The truth is that we had definitely grown weary of that whole pattern. The coupling would have been even stronger between the two systems. The core app took forever to boot up. The test suite took days on days. It was a new direction for the company that we thought was the future. We could leave the baggage behind and simplify.

We could launch this simplified experience and codebase in a new country and see if it worked. Specifically, it’s not the case that we were changing the airplane in flight. Because of the market segmentation, it was closer to a new startup. It would start with one person in London posting just like we did years ago in Boston instead of the whole load of our US app. This minimized the risk of technical glitches and being wrong about the business model substantially. I found it hard to argue with rails new in that reduced-risk environment.

Merge

When the new product did very well in London, the next step was to bring it to the US. It now went from being a new startup to having a merger with the old one. That’s the part of the scenario where things get tricky, of course. I’ll talk about the technical details of the migration some other time, but it actually went really smoothly. Because all the code was already running the London marketplace, there were no real technical issues either.

If there was a reason to do it all in the same ecosystem, it would have been the more human factor. It would have been easier/necessary to evolve towards the new product. This would have been a more gradual change for the people used to the way the site worked. It likely would have been a smoother transition, but also very painful behind the scenes. We would not have the clarity, simplicity, and improved power to innovate that we got from the rewrite.

A year and a half later, it’s pretty clear we made the right choice. The business is great and the tech stack is still pretty fresh and clean. At least it worked out better than it did for Hamlet and that’s all we can really hope for.

Copyright © 2017 Brian Leonard