De-identification and re-identification

Three months ago, you noticed that ACME Health was cheaper than other insurances and signed up. Reading their privacy policy, you discover they sell your data to third parties. The policy says this data “might include” demographics along with financial and health information. Jack from customer service reassures you: the data is “anonymous” (also called de-identified or depersonalized), and only a small sample of their data is ever shared.

Is it really anonymous? And could a few pieces of information be enough to pick you out?

ID: 54b8617e
Date of birth: 1955-01-01
ZIP Code: 02139
Gender: M
Occupation: Retired
…: …
Premium (p/mo): 10,000–15,000
Overall deductible: 5,000
Diabetic: No
ER (last 12mo): Yes
Pediatrics (last 12mo): No
…: …

Is the data really anonymous?

This is what one “anonymised” record from a fictional ACME Health might look like. Record n°1343 belongs to a male individual born in 1955, living in Cambridge MA. Alongside demographics, we can find medical information such as specialist visits, current deductible, and so on.

Researchers have long shown that, even without direct identifiers (name, email, social security number), data can be re-identified. In the mid-1990s, Latanya Sweeney re-identified Massachusetts Governor William Weld from his ZIP code, date of birth, and gender alone. More recently, German journalists re-identified the browsing history of a judge and a member of parliament from an “anonymised” dataset they bought by posing as potential customers.

Companies sometimes sample their data before release, by sharing say a few thousand records instead of millions. This is supposed to provide plausible deniability: if your employer finds a record matching your demographics in a 1% sample, it could just as well belong to one of 99.997% of other people in your country.

Does sampling actually help?

In theory, yes. Each dot is one person. A few share your basic demographics (in red), most do not. If someone gets a small sample of ACME Health’s customers and finds a record matching your demographics, how can they be sure it is yours, and not another red dot?

Sampling is one of the standard tools of releasing “anonymous” data. The Australian Information Commissioner’s De-identification guide states that sampling “creates uncertainty that any particular person is even included in the dataset.” The problem? It doesn’t work.

Let’s pause for a moment: there are many thirty-something men in New York City, so one match on those attributes could belong to anyone. But as you add characteristics, you match less and less of them. There is probably only one male, living in NYC, born on Jan 5 1987, driving a red Mazda, living with two daughters and one dog.

In our 2019 article in Nature Communications, we built a statistical model that estimates how often a match is precise enough to almost certainly identify a single person. We validated it on 210 demographic and survey datasets, showing that even very small sampling fractions do little to protect privacy. The model reaches AUC scores between 0.84 and 0.97 for predicting individual uniqueness, with a low false-discovery rate. We found that 99.98% of Americans could be correctly re-identified in any dataset using just 15 basic characteristics, such as age, gender, and marital status.

What does this mean?

Our model computes the probability that a match is correct, based on how many people in the population share your attributes. If you are unique, the match is always correct. If two others share your combination, the chance of correct re-identification is one in three. Contrary to popular belief, sampling does not provide plausible deniability, and it does not effectively protect privacy.

It is time to move past de-identification and tighten the rules for what counts as truly anonymised. Making data usable for statistical and medical research is essential, but cannot come at the expense of people’s privacy. NIGMS and NIH genetic data, the Washington State health data, the NYC Taxicab dataset, Transport for London bike sharing, and the Australian Medicare and Pharmaceutical Benefits Schedules have all been shown to be easily re-identifiable. As the U.S. President’s Council of Advisors on Science and Technology (PCAST) put it: “Anonymisation remains somewhat useful as an added safeguard, but it is not robust against near-term future re-identification methods. PCAST does not see it as being a useful basis for policy.”

How to cite this work

Luc Rocher, Julien Hendrickx, and Yves-Alexandre de Montjoye. “Estimating the success of re-identifications in incomplete datasets using generative models.” Nature Communications 10 (2019) 3069.

Source code

The underlying model is open source, in Julia and Python, on GitHub.