Another case of things connecting up oddly in my head—
"How do we know whether a dataset is any good?" is a vexed question in this space. Because the academy is accustomed to answering quality questions with peer review, peer review is sometimes adduced as part of the solution for data as well.
As Gideon Burton trenchantly points out, peer review isn't all it's cracked up to be, viewed strictly from the quality-metering point of view. It's known to be biased along various axes, fares poorly on consistency metrics, is game-able and gamed (more by reviewers than reviewees, but even so), and favors the intellectual status quo.
More pragmatically, it is also overwhelmed, and that's just considering the publishing system. I recently had a journal editor apologize for sending me a second article to review in the space of roughly a couple of months. At the risk of being deluged with review requests, I will say that the apology surprised me, because I don't do much article reviewing and was happy to take on another review—but the fact of the apology will serve as an indicator of a severely overburdened system.
We can't add data to the peer-review process. We can't even manage publishing with the peer-review process, it begins to seem. So where does that leave us?
Well, to start with, consider the difficulty of knowing how many have read a print journal article. Library privacy policies aside, counting such things as copies of journals left to be reshelved offers no usage data whatever on the level of the individual article. (This, of course, is one of the fatal flaws of impact-factor measurements as they are currently conducted.) So we have contented ourselves with various proxies: subscriptions as a proxy for individual readership, library reshelvings as a proxy for use, citations as a proxy for influence (which is somewhat more defensible, at least on the individual article level, but not without its own inadequacies), and so on.
Proxies. Heuristics. Because we can't get at the information we really want: how much does this article matter to the progress of knowledge?
Let me advance the notion that for digital data, especially open data, the proof of the pudding may actually be in the eating.
How do we know ab initio that a dataset is accurately collected, useful, and untainted by fraud? Well, we don't. But datasets when used have a habit of exposing their own inadequacies, if any. I know, for example, that Google Maps has a dubious notion of Milwaukee freeways because I once nearly missed my exit when Google Maps erroneously said it was a left exit. Judgment through use and experience.
I believe there is also a curious and potentially useful asymmetry between how publications and data are used. If I disagree with an article in an article I write, I still have to cite the article I disagree with. If I see a bad dataset, I don't have to cite it—I'm far more likely simply to disregard it, use data I do believe in. (This is probably an oversimplification; I can also try to discredit the dataset, or perhaps collect my own, better one. But I suspect the default reaction to faulty data will turn out to be ignoring it.)
Likewise, data may improve through use and feedback, as in many fields they are less fixed than publications. "I would like to mash up my data with yours, but I'm missing one crucial variable," may not be an insuperable difficulty! We can even see this winnowing function in action, as various governments and major newspapers start to release data and respond to critiques and requests. Even in libraries this process is ongoing, as we confront the mismatches between our current data standards and practices and the uses we wish to make of them.
If I am right, data usage metrics and citation standards for data take on new importance. How often a dataset has been directly used may turn out to be a far more useful heuristic for judging its quality than analogous heuristics in publishing have been… and best of all, if we manage citation with any agility whatever (a big if, I grant), use is a passively-gathered heuristic from the point of view of researchers, unlike peer review.
Elegant. I hope this is right. Time will tell, as always.
One of those articles on sharing data in biodiversity reported that for the smaller datasets, a criterion for selection and use was the data gatherer's "commitment to the organism". I thought that was pretty interesting. Yet another time a social aspect trumps some of the "objective" metrics. Maybe there needs to be more than usage, citation, and standard metadata.