Peer review, data quality, and usage metrics

By dsalo on November 24, 2009.

Another case of things connecting up oddly in my head—

"How do we know whether a dataset is any good?" is a vexed question in this space. Because the academy is accustomed to answering quality questions with peer review, peer review is sometimes adduced as part of the solution for data as well.

As Gideon Burton trenchantly points out, peer review isn't all it's cracked up to be, viewed strictly from the quality-metering point of view. It's known to be biased along various axes, fares poorly on consistency metrics, is game-able and gamed (more by reviewers than reviewees, but even so), and favors the intellectual status quo.

More pragmatically, it is also overwhelmed, and that's just considering the publishing system. I recently had a journal editor apologize for sending me a second article to review in the space of roughly a couple of months. At the risk of being deluged with review requests, I will say that the apology surprised me, because I don't do much article reviewing and was happy to take on another review—but the fact of the apology will serve as an indicator of a severely overburdened system.

We can't add data to the peer-review process. We can't even manage publishing with the peer-review process, it begins to seem. So where does that leave us?

Well, to start with, consider the difficulty of knowing how many have read a print journal article. Library privacy policies aside, counting such things as copies of journals left to be reshelved offers no usage data whatever on the level of the individual article. (This, of course, is one of the fatal flaws of impact-factor measurements as they are currently conducted.) So we have contented ourselves with various proxies: subscriptions as a proxy for individual readership, library reshelvings as a proxy for use, citations as a proxy for influence (which is somewhat more defensible, at least on the individual article level, but not without its own inadequacies), and so on.

Proxies. Heuristics. Because we can't get at the information we really want: how much does this article matter to the progress of knowledge?

Let me advance the notion that for digital data, especially open data, the proof of the pudding may actually be in the eating.

How do we know ab initio that a dataset is accurately collected, useful, and untainted by fraud? Well, we don't. But datasets when used have a habit of exposing their own inadequacies, if any. I know, for example, that Google Maps has a dubious notion of Milwaukee freeways because I once nearly missed my exit when Google Maps erroneously said it was a left exit. Judgment through use and experience.

I believe there is also a curious and potentially useful asymmetry between how publications and data are used. If I disagree with an article in an article I write, I still have to cite the article I disagree with. If I see a bad dataset, I don't have to cite it—I'm far more likely simply to disregard it, use data I do believe in. (This is probably an oversimplification; I can also try to discredit the dataset, or perhaps collect my own, better one. But I suspect the default reaction to faulty data will turn out to be ignoring it.)

Likewise, data may improve through use and feedback, as in many fields they are less fixed than publications. "I would like to mash up my data with yours, but I'm missing one crucial variable," may not be an insuperable difficulty! We can even see this winnowing function in action, as various governments and major newspapers start to release data and respond to critiques and requests. Even in libraries this process is ongoing, as we confront the mismatches between our current data standards and practices and the uses we wish to make of them.

If I am right, data usage metrics and citation standards for data take on new importance. How often a dataset has been directly used may turn out to be a far more useful heuristic for judging its quality than analogous heuristics in publishing have been… and best of all, if we manage citation with any agility whatever (a big if, I grant), use is a passively-gathered heuristic from the point of view of researchers, unlike peer review.

Elegant. I hope this is right. Time will tell, as always.

More like this

One of those articles on sharing data in biodiversity reported that for the smaller datasets, a criterion for selection and use was the data gatherer's "commitment to the organism". I thought that was pretty interesting. Yet another time a social aspect trumps some of the "objective" metrics. Maybe there needs to be more than usage, citation, and standard metadata.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

What An Eclipse Means For US President Donald Trump

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…