By Craig Silverstein

In last week’s exciting post, I described an alternative to transactions that we use at Khan Academy, to ensure atomic datastore operations.

When used correctly, both the user-write lock and transactions are effective at avoiding a particular form of database corruption — call it ”data stomping.” Data stomping happens when two requests try to modify the same datastore entity at the same time.

Request B does not see A’s modifications, and its PUT overwrites A’s PUT. A’s modifications are entirely lost, even when they don’t conflict with B’s.

Transactions solve this problem by noticing the contention at request B’s put() time, and forcing request B to retry from the beginning. Locks solve the problem by not allowing the time-overlap at all.

Note that for both techniques, you need to follow the GET - MODIFY - PUT idiom. It is an error — a db stomping waiting to happen — to do the GET outside the transaction/lock!

In this blog post, I describe the infrastructure we put in place at Khan Academy (which uses Google App Engine) to notice that error, and to make it easy to modify the source code to prevent it. We are making the source code available in two files:

db_hooks.py: a generic db/ndb hooking infrastructure
txn_safety.py: the specific hooks we use to detect and alert for transaction-safety violations

How do people use transactions (and locks) wrong?

The mistake people make is simple: they do the GET outside the transaction (or lock). Then when the transaction retries, it doesn’t re-GET, so you end up with request B stomping out request A’s changes.

You may think it’s easy to remember to always do your GET’s inside a transaction, but there are many ways to get this wrong:

You do the PUT in a function that’s far removed from the GET.
You are given an entity and forget to run entity = entity.key.get() to ”re-GET” inside the transaction
There are multiple codepaths used to GET an object, and only some of them — maybe the ones used 99% of the time, so everything seems mostly-fine — are done inside the transaction
The get() call gives a cached result

This last cause was a big problem for us: we would cache the entity corresponding to the current user, for efficiency. Then, whenever we wanted to update the current user, we’d do get_current_user().modify().put() inside a transaction, without realizing that get_current_user() was returning some cached entity that was fetched way before the transaction started.

The solution is pretty straightforward, once you realize there’s a problem. The issue is finding out there’s a problem in the first place, and then tracing through the code to find the problematic GET.

A Taxonomy of Data Stomping Errors

While the GET-outside-transaction error is the most common, there are many related types of data corruption. The infrastructure we put in place catches the following three types:

Stomping

Doing the PUT inside a transaction or user-lock, but not the GET.

@ndb.transactional
def seems_ok_but_is_not(uid):
        user_data = UserData.get_from_id(uid)   # cached!
        user_data.points += 5
        user_data.put()

Totally unprotected stomping

Doing a GET - MODIFY - PUT entirely outside a transaction or user-lock.

def badfunc(user_data):
    user_data_again = db.get(user_data.key())
    user_data_again.points += 5
    user_data_again.put()

Internal stomping

Doing two nested (or interleaved) GET - MODIFY - PUT’s inside a single transaction/lock.

@ndb.transactional
def _internal_fn(uid):
    user_data1 = get_user(uid)
    user_data1.points += 5
    user_data1.put()

@ndb.transactional
def public_fn(uid):
    user_data2 = get_user(uid)
    user_data2.points += 10
    _internal_fn(uid)
    user_data2.put()

How To Use It

To get the benefits of transaction-safety checking, you must annotate a db/ndb model with a decorator saying what method you use to guarantee safe put()’s:

@never_written_model() — super rare!
@abstract_model() — commonly for polymodels and utility classes
@structured_property_model() — for (Local)StructuredProperty models
@written_once_model() — easiest to use correctly (no need for transactions)
@written_in_transaction_model() — you put get-modify-put in a transaction
@written_with_user_lock_model(lockid_fn) — you put get-modify-put in a user write lock
@written_via_cron_model() — appengine lets you schedule cron jobs; if an entity is only accessed via a cron job, we know two requests will never access that entity at the same time
@dangerously_written_outside_transaction_model() — for legacy code
@dangerously_written_outside_transaction_model_or_user_lock() — ditto

These instruct the transaction-safety system what kinds of violations to look for. There is much more documentation of each choice at the bottom of txn_safety.py. Note that @written_with_user_lock_model takes an argument: that should a be a function that takes an entity and returns the lock_id for that entity. For instance, if the lock is protecting a single user, the lock_id might be the user-id. This is necessary because a single lock can protect many different entities. Example:

@db_decorators.written_with_user_lock_model(lambda e: e.kaid)
class UserVideo(db.Model):
    """A single user's interaction with a single video."""
    user = db.UserProperty(indexed=True)
    kaid = db.StringProperty(indexed=True)   # user's user-id
    video_key = object_property.KeyProperty(indexed=True)
    ...

Second, you have to wrap your WSGI application in the transaction-safety middleware:

app = webapp2.WSGIApplication([...routes...])
app = txn_safety.TransactionSafetyMiddleware(app)

Then you just run your application. If there is a transaction-safety violation, the system will log it:

Did a put() of the same entity from two different python objects: &lt;class 'user_models.UserData'>.
Other put:
--- 
File "/api/internal/scratchpads.py", line 408, in update_user_scratchpad old_points, old_challenge_status, client_dt, time_taken) 
File "/api/internal/scratchpads.py", line 436, in add_actions_for_user_scratchpad finished=(progress == "complete")) 
File "/scratchpads/models.py", line 2775, in record_for_user_and_scratchpad scratchpad=scratchpad) 
File "/rewards/triggers.py", line 119, in update_with_triggers_no_put user_data, possible_badges, dry_run=dry_run, **kwargs) 
File "/rewards/util_rewards.py", line 158, in maybe_award_badges_no_put badge.award_to(user_data=user_data, **kwargs) 
File "/badges/cs_badges.py", line 450, in award_to user_data, self.name, self.description) 
File "/notifications/cs_notifications.py", line 201, in send_certificate_notifications coach.put() 
File "/user_models.py", line 4173, in put result = super(UserData, self).put(*args, **kwargs)
---
Traceback (most recent call last): 
File "/api/internal/scratchpads.py", line 408, in update_user_scratchpad old_points, old_challenge_status, client_dt, time_taken) 
File "/api/internal/scratchpads.py", line 436, in add_actions_for_user_scratchpad finished=(progress == "complete")) 
File "/scratchpads/models.py", line 2777, in record_for_user_and_scratchpad user_data.put() 
File "/user_models.py", line 4173, in put result = super(UserData, self).put(*args, **kwargs) 
File "/db_hooks.py", line 55, in wrapper hook(model_or_models) 
File "/db_patching.py", line 613, in _examine_put_state _examine_tainted_put(entity) 
File "/db_patching.py", line 605, in _examine_tainted_put % (type(entity), tb))

This is an example of ”internal stomping.” If you had access to the source code, these tracebacks would be enough to tell you that record_for_user_and_scratchpad does a get() + put() of some user-data, and send_certificate_notifications does a nested get() + put() of the same user-data.

For power users, the source code documents functions like disable_user_write_lock_checking_in_test().

In the last blog post I mentioned that lock_util.py’s fetch_under_user_write_lock could not be used at that time. Well, with the functionality in this blog post, it can be!, making it really easy to re-fetch an entity — or not, as needed — under the user write lock.

def update_points(user_data):
    with fetch_under_user_write_lock(user_data) as ud_again:
        ud_again.points += 5

If we are already under the write lock, this is a noop, otherwise it will re-fetch the entity under the lock. It works for both db and ndb entities.

How It Works

The basic approach of the transaction-safety infrastructure is to annotate every datastore entity with a history of when it was retrieved from the datastore and what the state of the world was at the time: in transaction X, or under user lock Y. At put() time, it examines that history to make sure it’s in the same transaction or user lock — or indeed in any transaction at all — and complains if so, giving a traceback of the put() call to help with debugging. It also keeps track of whether the same entity was get()-ed multiple times, which is needed to detect internal stomping.

Here is a snippet from txn_safety.py to demonstrate how it works:

# For a newly created entity, we don't need a transaction.
if not hasattr(entity, '_ts_get_nonce'):
    return     # not created via a get()
get_transaction = getattr(
    entity, '_transaction_at_request_time', None)
put_transaction = _transaction_object()
if not get_transaction and not put_transaction:
    _ts_violation('Did not use a transaction')
elif not get_transaction:
    _ts_violation('Did the get() outside a transaction')
elif not put_transaction:
    _ts_violation('Did the put() outside a transaction')
elif get_transaction != put_transaction:
    _ts_violation('Did the get() and put() in different txns')

The bulk of the complexity is actually in db_hooks.py: the code for adding get-hooks and put-hooks in App Engine db and ndb models. While there is a built-in hook system for ndb, it is not adequate for our purposes because it only hooks get() calls, not queries. And the older db library has no hooks at all. db_hooks.py provides a uniform interface for hooking all functions that get or return entities in both db libraries.

Appendix: Non-Data Stomping Errors

Data stomping is not the only problems you can run into with db data. Here are 4 cases our infrastructure does not detect.

Stale reads

GET + GET - MODIFY - PUT + <use first GET>

@ndb.transactional
def goodfunc(user_data):
    user_data_again = user_data.key.get()
    user_data_again.points += 5
    user_data_again.put()

def oopsfunc(user_data):
    if should_assign_points:
        goodfunc(user_data)
    if user_data.points > 100:   # stale read!
        ...

Consistency

Two PUT’s that should be in a transaction together.

This (not db data stomping) is the traditional motivation for using transactions. If you are modifying both a coach and student to teach each about the other, that should happen inside a transaction. We do nothing to check that you do.

Overwrites

Two new-entity PUT’s with the same key at the same time.

If request A does MyModel(key='foo', value=1).put() and request B does MyModel(key='foo', value=2).put(), only one will win and the other will be thrown away.

App Engine provides get_or_insert(), which you can use in lieu of put() in situations where that is a concern. Note that this is only an issue if you explicitly specify a key param. Otherwise, unique keys are assigned automatically, and it’s impossible for two new-entity put()‘s to conflict.

Races

You want A’s GET - MODIFY - PUT to happen before B’s, but B goes before A.

API X is a call that gives a user some points. API Y is a call that sees if a user has enough points for a particular badge, and awards it if so. You want to make sure, in your request, that API X is called before API Y. But while our code guarantees those two API’s won’t update the user-data at the same time, nothing guarantees one request will run first. You have to do that ordering constraint in your own code.

Ensuring transaction-safety in Google App Engine

How do people use transactions (and locks) wrong?

A Taxonomy of Data Stomping Errors

How To Use It

How It Works

Appendix: Non-Data Stomping Errors

Why Computer Science Students Struggle: The Case for Mastery Learning

Meet the New Khan Academy Classroom Experience

How Khan Academy Is Building a Better AI Tutor: Our Most Recent Learnings