by David Barnett
Analytics at Khan Academy
Providing a free, world-class education to anyone, anywhere is a lofty goal and one that all of us at Khan Academy pursue with passion. Achieving it means trying to make the best decisions for individual learners across many countries, regions, districts, and schools. In order to do the right thing (and understand what that right thing is) across so many demographics, we must understand our successes and our failures.
The key to understanding our successes and failures is data. We have designed our systems to provide our analysts with the tools they need without compromising our commitment to privacy. In this post, I’ll go through some examples of how privacy-protecting analytics can be done.
Forget personal data while analyzing user actions
There are many reasons for an organization to store personal data. In Khan Academy’s case, you might want us to email you results, inform you of new features, or just let you know about a new assignment from your teacher! However, if you decide to sever your relationship with the site, we would want to protect your privacy by no longer keeping your personal information around.
It can be tough to balance privacy protection with the desire to know what people have been doing on our site in a more general sense. Fortunately, this is not an unsolvable problem. Let’s take a look at an example! (Note: The data and schema below are fictional.)
Users | |||||
---|---|---|---|---|---|
id | first_name | last_name | state | hours_used | |
1 | Nadda | Realname | nadda@example.com | MI | 100 |
2 | Madus | Allup | madus@example.com | CA | 38 |
3 | Imagine | Aryname | imagine@example.com | MI | 15 |
4 | Justin | Mymind | justin@example.com | OH | 68 |
5 | Will | Ibereal | will@example.com | CA | 103 |
Through this data, we might be interested in learning some basic things about who is using our website and how:
- How many users do we have?
- How many users do we have per state?
- What is the total number of hours of usage by state?
- How do states rank by average number of hours used?
Of course, there are lots of other things we could ask and answer, but one thing the questions above have in common is that (despite them being important and interesting) none of them require the analyst to know anything personal about any user. Without diving into SQL or other query methods, we simply don’t need to use any of the columns containing personally identifiable information (PII) to answer any of our proposed questions.
The best way to avoid misuse of personal data, whether intentional or unintentional, is not to give anyone (external or internal) unnecessary access to that data in the first place.
Mask PII with encryption
Our approach is to encrypt each user’s personal data with an encryption key unique to that user so that analysts can do their work without compromising personal information. These keys can be stored or locked down even further to be accessible only to a few analysts and used only when we need to communicate directly with the user.
Now the tables may look something like this:
Keys | |
---|---|
id | encryption_key |
1 | igwaordks |
2 | wiorjdfklv |
3 | fmnaasdnf |
4 | lkvjwekjsd |
5 | fhqwhgads |
Users | |||||
---|---|---|---|---|---|
id | first_name | last_name | state | hours_used | |
1 | Ipymv9XvfAWC6OAOZ6SBjwRkcrB1MN24= | yFaR17EO2luqxSP4CZEXjSOiUj1j4UeQ= | kprB9exzIqFtwqTTa0VIqfc7DlwCW1ssQG4/o2fNCLsu2iVp5C3Si | MI | 100 |
2 | 8yaTJ1DSPglQwJPn7aKEz1rjjS2YbeUGo= | 9vVhxLK4aQzhh+BxiiTgbrYbkKHDRo7RU= | 6z1rn2zawbS5JomjVFPVvFi8iCn1hiTDuJmusLC4vc4ME+/3ddX88 | CA | 38 |
3 | 9LJfdVIjSoVITCYttPKoUB5GzQCet0n58= | 6h+IJ6QXVIeSciFUPyvHoVsfjSUMEflk= | 9XT2X2KnPWCNa7NStM73q3jtB2KJA3g1LzK3E4LZWP1V3nOdnVUHZf | MI | 15 |
4 | viSyRXduVhun8fSgWqm1q6BA5h4haDPQ= | FcMEsH1wRPRWfPJ4Y47tyQ4iUVq+4neE= | AHbTTPAUMR3zs2leuxhDqr3ixuSwCFXSx0W52bm5EuJsc69NkzmvZ | OH | 68 |
5 | iQ2KNmKoBFpCG2oJNfGiwSXMOMH1ke8U= | 8rVZDpsKFDxyxrTKo41MXFMe48XIG30FU= | vG/Q90GW24WLQFR0mzZQdGjIlDKRjcNBse6t00ewy6IEDidhpn4yA | CA | 103 |
Only people who have access to the keys table will have any idea how funny the names I made up are. But, that’s okay because it’s none of their business. It is their business to be able to answer questions about how people are using the site, and they can still answer all the questions we listed above.
Preserve analytics, but forget PII
It’s common for a provider to store user information in multiple places, such as distinct databases, backups, an object store, or a data warehouse. Given the multiple locations, it makes sense to simplify the anonymization process with a single control point.
However, if we always store user data encrypted and keep the decryption key in a single place, then we only have to worry about deleting the user decryption key. Once the user’s key is removed from the keys table, there is no way to recover the user data. And, since we don’t require any personal information for analytics purposes, we don’t lose the ability to answer our general, aggregated questions about their usage.
This approach has allowed us to respect our users’ right to privacy while still being able to provide essential information to our data analysts and nonprofit leadership.
We at Khan Academy love working with data! Are you interested in working with our data or any of our other tools/teams? Our team comes from a wide variety of backgrounds, and we actively foster a cross-disciplinary environment because we believe that’s where the magic happens. Khan Academy currently employs around 200 full-time staff, including the creators of our educational content, who come from teaching backgrounds. Learn more and explore open positions.