Beating the odds: Khan Academy’s successful monolith→services rewrite

by Marta Kosarchyn

When I last posted in early 2020, the COVID pandemic had just caused the sudden shutdown of schools, and Khan Academy became a uniquely vital learning resource for families and for teachers virtually overnight. A lot has happened since then: remote and hybrid learning rollouts in schools, struggles for many students to get access and motivation to learn without the physical community of their school, and recently a joyful return to classrooms for most students.

Our ability to scale and stay reliable during the explosion of usage on our site was huge, and worthy of reflection on what served us so well in our ability to smoothly absorb massive, sudden growth on our site.

We aimed to lower hosting costs, improve performance, and further optimize our already-impressive reliability.

Left unsaid then was that we were also in the beginning phase of a complete re-architecture of the entire Khan Academy site backend. Our stalwart and long-in-the-tooth Python 2 monolith backend was in need of a port, and we knew that we were also due to update the architecture more substantially, to move to a services-based platform. We aimed to lower hosting costs, improve performance, and further optimize our already-impressive reliability. To meet these needs, we chose the Go language and named the project Goliath. In spring 2020, Goliath was just underway and in its early stages of uncertainty, with a daunting backlog (we were looking at close to 1M lines of Python code to port) and some of the natural low early-stage velocity.  Many upcoming challenges were still unforeseeable at this point in a giant project, but we had begun training everyone up on Go and solving many of the juicy architectural problems like selecting the optimal number of services, gateways, etc.  Now, perhaps most critically, we needed to nail how we were going to port “in-flight” i.e. switch out the backend incrementally without disrupting the site. We were still in the hard early push of the proverbial fly-wheel. Much heavy and unsteady lifting still awaited us before we could fall into a regular groove and, finally, a predictable, smooth burndown velocity.

I’m very excited to write that we’ve reached our Goliath MVE (minimum viable experience) milestone, which we defined as readiness of the part of the site that absolutely had to be ported for our users to be able to continue learning our core content—since Python 2 was deprecated and Google’s support for it in AppEngine (on which we depend) was not guaranteed, it was a risk mitigation strategy as well as a major milestone. Once we had landed MVE, we would not only have a core new backend in production, but we’d also know that we had a successful, proven approach to complete the port. We would be able to continue the rest of the work methodically and with more room for investing in product features in the next and final phase before we shut down the Python monolith in 2022.

On August 30th, 2021, the backlog for the milestone hit zero points, and our new services backend was taking 95% of all traffic. We immediately (well, we did stop to celebrate—virtually—and get ourselves outfitted in really cool sweaters) turned to the Goliath Endgame, and a new and in many ways even more complex world for our engineering team: balancing completion of the port with a satisfying return to developing valuable new features for our learners, teachers, and school/district leaders as they leverage Khan Academy as an essential supplement to their in-class, curriculum-driven work.

Why did we succeed so definitively where so many other projects fail?  

People

The team. There is no way we would have been successful without this particular set of folks being in these particular roles here at this time. The amount of dedication, focus, and resilience, in addition to top-level technical chops, that it took to overcome pandemic conditions, unforeseen technical obstacles, and stuff-that-happens downstream in our third-party technologies cannot be overstated. Only a special community could have pulled this off, and I am 100% certain that our engineering team was the secret sauce to this massive accomplishment. This was work under stress, and it required an incredible amount of heart.

Process

Key components of our process were modeling everything, measuring to the model, monitoring results, and continual adjustment of the process based on sprint results.

Not only did we land, we did so within four months of the initial estimate of completion made 20 months prior.

Modeling (work estimates, resources, skill sets, time off, dependencies, parallelisms) had to be the starting point in planning. We needed the ability to predict which investments should lead to which results and what tweaking any of the parameters would imply with respect to reaching the finish line. Our planning had to be impeccable, but most importantly it had to be accurately measurable so we could react, motivate, and think outside of the box. By the way, not only did we land, we did so within four months of the initial estimate of completion made 20 months prior. Once we had a plan, we began continual measuring and calibrating metrics, tracking not only burndown but also burnup (scope growth as work was understood at a more detailed level). We measured completion status for each service. We measured how much traffic was being served by the new backend versus the monolith as our incremental migration process made progress. We measured the number of GraphQL fields completed.

Detailed and constant measuring led us to a few critical points where we had to “go slow to go fast.” We did this in order to get a more accurate picture of where we were while in the middle of a giant project like this, “switching out the engines of the plane while in flight” (to which we added “with every passenger seat filled and a party onboard” as the pandemic broke out).  

Principles

Before we ramped up on the work itself, we adopted a set of principles to help drive optimal decision-making by all of the engineering team during the project.

The first principle derived from our goal to not add to the work ahead while we were in the midst of it. Contrary to our normal, highly-distributed decision-making process, I personally was the approver for any new features written in Python. Writing more Python 2 code would just add to how much we had to port. Goliath MVE was long enough that we couldn’t avoid writing some new Python, but there were very few features we absolutely had to write in Python.

Likely the most important of the principles was “avoid scope creep.” At every turn, we sought to do as direct a port as possible from Python to Go, while still ending up with code that read like Go code. As with any system, Khan Academy’s code has areas that we wish were structured differently and minor annoyances that we’d like to fix, but if we tried to tackle all of those at the same time as porting to Go, we would never finish. In reality, we did do some refactoring and cleanup along the way, but this principle helped us keep these to a minimum.

Agility or adapting classic agile to a large-scale, time/scope-bound project

Originally, we had planned to leverage agile planning more purely as usual, with traditional story points defined within each team. But it quickly became apparent that we had to use a canonical form of a story point and defined a story point to equate to one engineer-day of work. This was necessary to be able to model some 80 engineers working together on a project this size.  

We also realized along the way that we needed to create checkpoints/milestones that would not only give us a sense of product workflow completion but also serve as opportunities for celebration. (Celebrating was and is important to us. We celebrated big and hard-won milestones, and we celebrated random mile markers. We celebrated anything we could along the way, especially since most of the work was done without the chance to meet up in person.)

We retained two-week sprints from our usual development process and highly leveraged sprint retros for the checkpointing that fed into our continual measuring and adjusting.

Side by side testing

In a related post, Principal Architect Kevin Dangoor writes in more detail about the technical approaches that were key to success. But I did want to call out one aspect that I’m particularly proud of the team for executing so well because I think it was so critical to success. To solve for incremental migration, we needed an efficient way to make sure that functionality being replaced was equivalent. We called the approach the team devised “side by side testing,” which we did by using a gateway to route (most) traffic for a request to the monolith and (some) traffic to new services, and we then compared results until all bugs were found and fixed in the new code. At that point, all traffic for those requests went to the new backend.  

Borderless engineering

Borderless engineering, a somewhat hokey phrase I pulled out of the air one day early in the project, meant that engineers would work on various product code areas and new services, stepping beyond the borders of their usual product area responsibility. This had to be done carefully, making sure that engineers would spend sufficient time to learn a new service area to be able contribute thoughtfully. Sometimes it made sense to “loan” engineers to another team, and sometimes it meant shifting ownership for parts of a service (or an entire service) to another team. Done carefully, such moves were a lynchpin of our ability to deliver.

One goal

Our team embraced “we all succeed or fail together;” there were no individual accomplishments, only a sole engineering team accomplishment, with all eyes on the same prize—hitting the MVE milestone. This exceptional egoless team role-modeled selfless collaboration across each and every engineering team member.

We’re not done, but we can see the finish line, and we look forward to a giant party before the end of 2022.

Onward through the Endgame!