We count on you! Protecting privacy, enabling analytics

by David Barnett

Analytics at Khan Academy

Providing a free, world-class education to anyone, anywhere is a lofty goal and one that all of us at Khan Academy pursue with passion. Achieving it means trying to make the best decisions for individual learners across many countries, regions, districts, and schools. In order to do  the right thing (and understand what that right thing is) across so many demographics, we must understand our successes and our failures. 

The key to understanding our successes and failures is data. We have designed our systems to provide our analysts with the tools they need without compromising our commitment to privacy. In this post, I’ll go through some examples of how privacy-protecting analytics can be done.

Forget personal data while analyzing user actions

There are many reasons for an organization to store personal data. In Khan Academy’s case, you might want us to email you results, inform you of new features, or just let you know about a new assignment from your teacher! However, if you decide to sever your relationship with the site, we would want to protect your privacy by no longer keeping your personal information around. 

It can be tough to balance privacy protection with the desire to know what people have been doing on our site in a more general sense. Fortunately, this is not an unsolvable problem. Let’s take a look at an example! (Note: The data and schema below are fictional.)

Users
idfirst_namelast_nameemailstatehours_used
1NaddaRealnamenadda@example.comMI100
2MadusAllupmadus@example.comCA38
3ImagineArynameimagine@example.comMI15
4JustinMymindjustin@example.comOH68
5WillIberealwill@example.comCA103

Through this data, we might be interested in learning some basic things about who is using our website and how:

  • How many users do we have?
  • How many users do we have per state?
  • What is the total number of hours of usage by state?
  • How do states rank by average number of hours used?

Of course, there are lots of other things we could ask and answer, but one thing the questions above have in common is that (despite them being important and interesting) none of them require the analyst to know anything personal about any user. Without diving into SQL or other query methods, we simply don’t need to use any of the columns containing personally identifiable information (PII) to answer any of our proposed questions. 

The best way to avoid misuse of personal data, whether intentional or unintentional, is not to give anyone (external or internal) unnecessary access to that data in the first place.

Mask PII with encryption

Our approach is to encrypt each user’s personal data with an encryption key unique to that user so that analysts can do their work without compromising personal information. These keys can be stored or locked down even further to be accessible only to a few analysts and used only when we need to communicate directly with the user. 

Now the tables may look something like this:

Keys
idencryption_key
1igwaordks
2wiorjdfklv
3fmnaasdnf
4lkvjwekjsd
5fhqwhgads
Users
idfirst_namelast_nameemailstatehours_used
1Ipymv9XvfAWC6OAOZ6SBjwRkcrB1MN24=yFaR17EO2luqxSP4CZEXjSOiUj1j4UeQ=kprB9exzIqFtwqTTa0VIqfc7DlwCW1ssQG4/o2fNCLsu2iVp5C3SiMI100
28yaTJ1DSPglQwJPn7aKEz1rjjS2YbeUGo=9vVhxLK4aQzhh+BxiiTgbrYbkKHDRo7RU=6z1rn2zawbS5JomjVFPVvFi8iCn1hiTDuJmusLC4vc4ME+/3ddX88CA38
39LJfdVIjSoVITCYttPKoUB5GzQCet0n58=6h+IJ6QXVIeSciFUPyvHoVsfjSUMEflk=9XT2X2KnPWCNa7NStM73q3jtB2KJA3g1LzK3E4LZWP1V3nOdnVUHZfMI15
4viSyRXduVhun8fSgWqm1q6BA5h4haDPQ=FcMEsH1wRPRWfPJ4Y47tyQ4iUVq+4neE=AHbTTPAUMR3zs2leuxhDqr3ixuSwCFXSx0W52bm5EuJsc69NkzmvZOH68
5iQ2KNmKoBFpCG2oJNfGiwSXMOMH1ke8U=8rVZDpsKFDxyxrTKo41MXFMe48XIG30FU=vG/Q90GW24WLQFR0mzZQdGjIlDKRjcNBse6t00ewy6IEDidhpn4yACA103
Astute observers will notice that this data looks base64 encoded rather than encrypted. That’s because we’ve base64 encoded these values after encryption in order to make storage (not to mention display in this article) simpler.

Only people who have access to the keys table will have any idea how funny the names I made up are. But, that’s okay because it’s none of their business. It is their business to be able to answer questions about how people are using the site, and they can still answer all the questions we listed above.

Preserve analytics, but forget PII

It’s common for a provider to store user information in multiple places, such as distinct databases, backups, an object store, or a data warehouse. Given the multiple locations, it makes sense to simplify the anonymization process with a single control point. 

However, if we always store user data encrypted and keep the decryption key in a single place, then we only have to worry about deleting the user decryption key. Once the user’s key is removed from the keys table, there is no way to recover the user data. And, since we don’t require any personal information for analytics purposes, we don’t lose the ability to answer our general, aggregated questions about their usage.

This approach has allowed us to respect our users’ right to privacy while still being able to provide essential information to our data analysts and nonprofit leadership.

We at Khan Academy love working with data! Are you interested in working with our data or any of our other tools/teams? Our team comes from a wide variety of backgrounds, and we actively foster a cross-disciplinary environment because we believe that’s where the magic happens. Khan Academy currently employs around 200 full-time staff, including the creators of our educational content, who come from teaching backgrounds. Learn more and explore open positions.