by Kevin Dangoor
In December 2019, I wrote an introduction to our “Goliath” project: We needed to rewrite Khan Academy’s Python 2-based backend, and we chose to build on a model of Go-based services with a GraphQL API as the replacement. This article is one of a pair of posts looking back on our successfully-completed minimum viable experience (MVE) project. This one is focused on technical choices we made that helped us succeed in a project that commenters online warned would be perilous. The other, by our CTO and VP of Engineering, Marta Kosarchyn, covers some of the higher-level levers we employed to get the job done.
For a sense of the scope of Goliath MVE, we sought to port all functionality required to have a viable site if we had to turn off our Python 2 monolith, including our content library, logging in (in several different ways), learner and teacher dashboards, the ability to track a learner’s progress, the ability for teachers to create classrooms and assign content, the ability for districts to generate reports, core test preparation functionality, and much more. Many people were surprised when we crossed half-a-million lines of Go, but Khan Academy really does have a lot of functionality. We estimated that the MVE subset was about 80% of all of the porting work we’d need to do.
Solve the hard problems first
At the end of 2018, we started discussing in earnest what we should do about our Python 2 problem. We actually had two issues: needing to move from Python 2 to 3 and needing to move from Google App Engine first generation to second generation (required for Python 3 support). The combination of these two substantial migrations are what led us to consider porting to a new language and architecture in the first place. People who have done a Python 2 to 3 transition have used tools and techniques to make the system work in both versions for a period of time, but those just wouldn’t work for us because of the other changes to the libraries and runtime environments we were working with. That said, we did estimate Python 2 to 3 with the other architecture changes as being less work than porting to another language, but we decided that the benefits of porting were worth it.
For more than four years, Khan Academy has been using Architecture Decision Records (ADRs) to work through and document changes to our systems and tools that could have broad impact. Goliath kicked off in March 2019 with an ADR that documented our move from Python monolith to Go services. After that, a few people did exploration and testing which led to a flurry of other ADRs laying the groundwork for what our system would look like. For example: Which library/tool would we use for GraphQL in our services? (gqlgen) Which linting tool would we use? (golangci-lint) Those kinds of choices were relatively easy to make but still had far-reaching implications. In a moment, I’ll talk about a more difficult problem to tackle.
Though ADRs are docs and not code, I look at them as an incredibly useful tool. They let us communicate consistently and with enough detail about choices we make and prove their worth again when we start thinking about changing those choices later on.
Another similarly useful tool in the early days of the project were “service boundary” docs. Keep in mind that we were moving from a monolith to services. In an early Goliath ADR, we decided that each service would be uniquely responsible for its own data and that other services would only access that data via the GraphQL API. Coming from a monolith, this would be a big, but important, change. So, the service boundary docs existed to talk through how our data models and code would be split between the services. Some time later, we had a “Meet the Goliath Services” doc that consolidated basic information about all of the services (26) in one place.
In the latter part of 2019, a very simple service was stood up behind our graphql-gateway service and started responding to production traffic. Putting the time into the groundwork and solving the hard problems first was a key for us to be able to ramp up well on the main work in 2020.
Used tools to help us move faster
A tool to convert our App Engine-specific Python 2 code to idiomatic Go would have been great, but that’s an unrealistic fantasy. We were able to make more specialized tools to help us, though. For example, one such tool would convert our Datastore models to Go.
We’re big believers in static analysis as a way to improve consistency and avoid bugs, so our engineers built linters early on to help us avoid common mistakes and evolved those linters as we learned more about what worked well. We built genqlient to help us easily and reliably do service-to-service calls by giving us type safety between services.
Did the rewrite incrementally
When I first wrote about Goliath, some people commented that a “big bang” rewrite would be incredibly risky. We agree! I wrote another blog post about how our GraphQL federation-based architecture meant we were being as incremental as possible in the approach, moving small numbers of fields at a time and testing with real traffic as we progressed.
This approach of shipping units as small as possible limited the risk that we would ship changes that caused breakage for our users. Over the course of Goliath, our code changes stayed within our outage budget. More and more of our users hit our new Go services without even knowing there was a change!
At the beginning of the project, we stressed the importance of working on small slices as much as possible. In those days, this was tricky because the goal might be to switch over one GraphQL field, but that field might depend on a variety of other machinery running inside the monolith. As time progressed, more of the other required parts had been ported, making further porting for the same service smoother.
It’s worth noting that small increments are an ideal, and there are always practical considerations. Toward the end of the project, we were finally able to port some heavily used (think millions of calls a day) and complex functionality that depended on many other pieces being in place. This was a bigger change than we’d normally want to make, but there was no other way to ship it. We tested this change with a canary release (shipping to 1% of users and increasing as we verified that everything worked fine), and all was well. Of course, even this “larger than we’d like” cutover was far, far smaller than the “big bang rewrite” people feared we were doing.
Accepted that we couldn’t escape all logic changes
We wanted to build as close to a direct port as possible to help the project move faster and avoid accidental user-visible changes to the way the site worked. This truly was a key component of the project’s success, but the “as possible” still doesn’t mean “everywhere”.
In moving from a Python monolith to twentyish Go services, we had no choice but to change certain function calls into cross-service GraphQL calls. Some we might be able to eliminate by making use of GraphQL federation (where our GraphQL gateway automatically pulls together data from multiple services simultaneously and delivers that data to the client), but otherwise we’d just go ahead and do the cross-service call. Then, we’d sometimes have to change things a bit to avoid a single call from the user exploding into 1,000 calls to services. Our solutions were not always absolutely ideal but, again, a key component to making this change successfully was not trying to perfect every little piece. This project required us to replicate the behavior with similar performance, and we’ll work on making things better incrementally from there.
To get similar performance, we sometimes had to change our caching patterns from what we had used in the monolith. We made dramatic changes to how we cache our content data in order to make the frequently accessed data performant, and what we ended up with benefitted from being in a separate service compared to our old monolithic architecture. We had a smaller number of servers with larger amounts of RAM available to have the frequently accessed content available right in instance memory.
Could we have done something differently?
Undoubtedly, there are many different paths we could have taken to address Python 2’s end-of-life, but this was such a large effort that it’s hard to say any given path really would have worked out better.
One example that comes to mind: We could have started migrating earlier, allowing us to have a more equal split between feature work and Goliath. A problem with this approach, though, is that stretching out the migration time would mean potentially having to build more features in Python while the porting effort was ongoing. Also, our timing for Goliath lined up well with Apollo Federation being released and tools like gqlgen becoming mature. While it’s true that tooling will always be moving forward, we had already started down the GraphQL path and federation made our move to services much easier.
Looking forward
Later this week, my GopherCon talk digging into the Goliath MVE project will go live, so watch for that.
Our port to Go-based services was a long-term bet that will pay off in decreased costs and more flexibility over time. We’re already seeing great results now that the bulk of our system has migrated. If you’re interested in joining us to help with our big, audacious nonprofit mission to provide a free, world-class education to anyone, anywhere, check out our careers page.