Sensitive data, linked data, and the "reidentification" phenomenon

One of the truisms in data curation is "well, of course we don't let sensitive data out into the wild woolly world." We hold sensitive data internally. If we must let it out, we anonymize it; sometimes we anonymize it just on general principles. We're not as dumb as the Google engineers, after all.

Only it turns out that data anonymization can be frighteningly easy to reverse-engineer. We've had some high-profile examples, such as the AOL search-data fiasco and the ongoing brouhaha over Netflix data. Paul Ohm's working paper on the topic is a great way to get up to speed.

We librarians are fairly dogmatic about this sort of thing, owing to our professional-ethics commitment to your freedom to read. We wipe your checkout record clean after you turn your items back in. We do keep passive-voice usage records on our materials: "this book has been checked out X times since Y date." But that's it. (And no, we don't keep track of when you visit the library, so it's not possible to connect a formerly checked-out book with you based on the date of checkout.)

This long-standing design decision is being challenged on social-media grounds; it's hard to build Web 2.0-ish applications around your library behavior if we don't keep records of your library behavior! I used to be on the Web 2.0 side of this particular controversy, but as I've been reading about reidentification, my mind has changed. Information about which local public library one goes to isn't precisely "zip code," but it's awfully, awfully close.

Anyway, the application to human-subjects data of all stripes is, I hope, obvious. It's not as simple as anonymizing data; even aggregating it and only permitting queries may not solve the problem. Certain data breakdowns (e.g. from survey data) may be problematic.

Taking heed of the problem is the first step to solving it—but only the first. The sooner we have data-release guidelines that take reidentification into account, the happier I will feel about open data in the social sciences and medicine.

Incidentally, are you as sanguine about governments providing "linked data" as you were? Because I'm not.


More like this

Reading the articles you link to, I'm not sure it's POSSIBLE to have data release guidelines that take re-identification into account. It seems like any individual data is too much data.

Reading about this changes my thoughts on user data as well.

I forget what he's released publicly and what he hasn't, but Dave Pattern has developed some services based on "people taking this university course borrowed these books" -- no individual data is used (except to make the aggregations, of course), just aggregated course data. Using/releasing only aggregated data like that might be safe... but I'd want to talk some very smart statisticians first, like the people who wrote those articles. And even then, what if they're wrong? One of the articles said Netflix did pass their original data release through some 'experts', who said it would be fine...

The basic "reidentification" attack is based on correlation. I don't recall anyone in college taking exactly the same set of courses I did in any given semester, so if I can find out what courses you're enrolled in, and also have access to the services you've described from Dave Pattern, I can probably figure out what you've personally been taking out of the library.

The gist of some recent papers is that it's easier to narrow down to individuals based on correlated data than a lot of people would intuitively imagine. (The Ohm paper discusses the correlation of zipcode, sex, and birthdate. In another project at CMU, researchers found they could guess many social security numbers of individuals if they knew when the person was born and where they were from, especially if they got the supposedly "safe" last 4 digits of the SSN from some other source.)

I like open data, as I've made clear elsewhere, but one does need to be aware of the risk of leaks of individual data that was never intended to be disclosed through the careless handling of "aggregate" user data.

Given the countless arguments I've had on this topic (not so much with you), I am delighted to see you coming over to the side I've been on for...well, for always, ever since the FBI Library program in the 1960s. The benefits just aren't worth the dangers...and "let the user decide" doesn't work 'cause patrons by and large won't understand the extent of the dangers.

So, does using behavior tracking to support 'people who liked this also liked that' and similar behavior (as netflix etc do) _internally_, without ever publisizing the underlying data... is that similarly problematic? I'm not sure?

Of course, it's much harder to do for libraries if we can't share the data with each other. And I don't have much confidence in libraries to actually keep the underlying data secure and confidential.

Well, okay, let's say that what we're trying to prevent is the Furious Bureaucratic Interferers finding out that John Doe has read Bombs Away!

Based on aggregated information that includes John's checkout record, we display proudly that people who check out The King Is A Fink also check out Bombs Away! John's wishlist at Amazon, or his LibraryThing or his GoodReads collection contains the former but not the latter.

Would that stand up in court? ... Probably not, as long as there's enough information aggregated that John's unique reading habits don't finger him. Basically, John has to hope that a lot of other people read The King Is A Fink.

So, Jonathan, my somewhat-uneducated impression is that preference tracking is liable to lead to problems except on a pretty massive scale. I welcome better-educated takes on this; it seems to me pretty similar to the Netflix question.

"let's say that what we're trying to prevent is the Furious Bureaucratic Interferers finding out that John Doe has read Bombs Away!"

One problem is that preventing them from "finding out" is harder than preventing them from proving it beyond a reasonable doubt in a court of law. If I can find out something *probable* about you that you don't care to publish, that in itself can be enough to give me undue leverage over you. I can use this information to choose you to investigate more closely, and to use as a lever to get information or cooperation from you or your peers (using techniques seen in many detective novels). I can use it as a way of guessing probable credentials to get me more information (that's what the CMU study I mentioned did with SSN correlations). If I'm in a position of authority, I can even put you on a terrorist watch list, or establish *probable* cause for a wiretap warrant, based on this information.

The degree of correlation you need for *probable* reidentification can be quite a good deal less than for *definite* reidentification. Sometimes quite a bit less. Which is one reason that projects releasing aggregated personal information have to very careful about it, to make sure one can't make mischief by probabilistic slicing and dicing.

And if secure handling of personal data is challenging even for conscientious data collectors, it's an even bigger problem when you're dealing with collectors who might not be so conscientious.

I've heard from some folks in small business about shenanigans of this sort with respect to health coverage. When insurers raise premiums on group health insurance, they're not allowed to identify particular people who are making claims of any particular sort. But apparently insurers are allowed to give aggregated risk information to a certain extent, and some get rather... detailed about it. To use a made-up example, if your insurer tells you "we're seeing higher risk in 30-35 year old females who work in IT", it wouldn't be hard in many companies to figure out that your premiums would go down if you could find a reason to fire Sue.

I've heard similar stories, John. Not pretty.

And your point about probable versus on-the-nose is right on.

Thanks D for this post. My take on it is still from the other side, however. If (like me) you feel the motivation of Open Data is sound, and that data for research SHOULD be available for verification, re-analysis and re-use, AND you think there is useful mileage in the Linked Data paradigm, how can we best deal with our sensitive, necessarily Less Open Data?

Part of the answer may be some form of expert disclosure analysis (not necessarily perfect, but even Closed Data doesn't give perfect protection, witness the various data loss scandals). However, if expert disclosure analysis is inside the pipeline (rather than at the design stage), you're about as far from Linked Open Data as you can get without being completely Closed.

By Chris Rusbridge (not verified) on 14 Mar 2010 #permalink

Well, I'm afraid you lost me at "data SHOULD be available." Seems to me that's the main question here, and I'm not at all sure the answer is "yes."