Making standards that work

By dsalo on November 2, 2009.

One phenomenon that will be—indeed, already is—utterly unavoidable in the data-curation space is the creation of standards. I once heard Andrew Pace say that standards are like toothbrushes: everybody thinks they're great, but nobody wants to use anybody else's.

Be that as it may, standards development and compliance is one way to make everybody's data play nicely with everybody else's data. It's not the only way, to be sure; one very important way that I'm sure we'll also see more of is Being The Only Game In Town. ICPSR manages this quite successfully, and so does the Digital Sky Survey. If you want to be important in the data spaces dominated by either of these large players, you play by their rules, just that simple.

When there's no big player to lay down the law, though, standards development becomes more attractive. How do you make a standard, then? More to the point, how do you make a good standard, a standard that works, a usable standard, a standard that will last?

I liked this blog post by Adam Bosworth about standards development very much. I think it captures much of the excellence that goes into successful standards as well as the dysfunction attending failed ones. I do want to add a fillip of my own, though, based on my own experience helping to build standards and trying to use standards built by other people.

When you're in a roomful of people tasked with building a standard, make sure the room contains representation from every group of people who will be asked or required to use it. That emphatically includes the non-technical and the non-specialist. It goes double or triple if the standard will affect existing technology installations: you must have someone in that standards room who uses the existing technology! No, a developer of the existing technology does not fulfill this requirement, because the distance between developers' understanding and users' understanding is often vast.

If the non-technical, non-specialist representative in the room can't understand the standard, it will fail. If that representative can't produce data that fit the standard, likewise. I agree with Bosworth's reservations about RDF; I myself have trouble understanding it and putting it to use, despite a decade's experience with markup, and I believe the tribulations such folk as I face when trying to deal with it have retarded its adoption significantly.

What happens when this rule about representation is flouted, but standards are published anyway, is standards that fall apart under real-world use. I will adduce OAI-PMH as an example. It follows quite a few of Bosworth's recommendations: it's simple (I have explained it in twenty minutes to library-school students), largely human-readable, focused, precise about encodings, in possession of real implementations, and free on the web.

It is also flawed. Huge projects built on it have found its flaws impossible to bypass and expensive to work around (see Lagoze et al. 2006 for how NSDL ran aground on OAI-PMH's inadequacies).

The major flaw, to my mind, isn't difficult to explain or to understand: OAI-PMH has no error-reporting built in. In a protocol standard built for communication of and about metadata, nobody in the standards-design process ever seems to have asked the (to me) simple and obvious question, "What happens if the metadata is malformed or otherwise wrong?"

Anyone who's worked on the ground with repositories of any stripe knows that metadata problems, sometimes gross problems, are par for the course. For that matter, any librarian can explain the pitfalls of metadata and citation creation at great length. I honestly can't tell you why OAI doesn't seem to have on-the-ground repository managers and other librarians capable of raising such practical issues working on its standards bodies.

I can, however, tell you that they should. The latest OAI development, OAI-ORE, contains exactly the same no-error-reporting weakness I just pointed out in OAI-PMH. Yes, some of the underlying technologies that OAI-ORE is built on contain certain kinds of error reporting, but the aggregation of those errors that can be reported is only a subset of the errors that I believe will crop up.

To make standards that work, include people on the standard-design team who work with the processes underlying the standard. Now that you know this—go forth and standardize!

More like this

Was the reduction in Orlando rapes statistically significant?

Crime rates go up and crime rates go down. Before seizing on some possibly coincidental factor such as gun training or gun control as the cause of the change, we need to establish if the change was unusual, i.e. statistically significant. The only attempt I have

May All Your Standards Be Simple and Evolvable

I was in a roundtable yesterday talking about Health IT with a bunch of very smart people in the bay area. It was sort of a briefing of ourselves and others about the real issues underpinning what it would take to generate real disruptive innovation in health technology and health costs.

The Basics of Statistics II: Standardized Normal Distribution and Z-Scores

So in the last post, we talked about the normal distribution, and at the very end, discussed that if you knew the mean and standard deviation of a population for a particular variable, than you can compute the probabilities associated with a particular value of that variable within that populatio

The Minnesota Science Standards are due for review

This is the time — you can give feedback on the Minnesota science standards, and you can also apply to be on the standards writing committee. Here's where you have a chance to make a difference.

Thanks Dorothea for another nice post. My main comment will relate to some of your other discussions (and my main interest) on data curation. But in passing I'll comment that an earlier standard for metadata sharing - Z39.50 - while almost certainly built with librarian involvement, was also significantly flawed, resulting in "virtual union catalogues" (I'm sorry I coined the term "virtual clumps", for shame!) that frequently failed with known item searches (because of multiple levels of mapping). My take here is that making standards that really work is just plain hard. The IETF "rough consensus and working code" approach looks the best, but there are still plenty of unsuccessful IETF RFCs.

I had not seen the Lagoze article you pointed to, for which thanks. It's doubly interesting as Lagoze was one of the main participants in creating OAI-PMH. I particularly liked this bit:

"But more problematic was the reality that the personnel requirements to share metadata were deceptively high due to what can be characterized as a âknowledge gapâ. Successful provision of metadata actually involves three distinct skill sets:

1.Domain expertise â knowledge of the resources themselves and their pedagogical goal.
2. Metadata expertise â knowledge of cataloging practices such as use of controlled vocabularies and proper formatting of data such as names and dates.
3.Technical expertise â knowledge of tools involved in setting up and running an OAI-PMH server including XML, XML schema, UTF8, and HTTP.

We found that very few NSDL collections had a single person, let alone a team, with these three skill sets."

I think that combination in slightly different form could describe what's needed for building a workable data repository/data curation service in an institution. It's clearly more than library skills, more than IT skill, and more than domain science skills, but somehow combining all of them. The problem is that the library and IT skills can be acquired and shared in a scalable way, but how do we make the domain skills scalable?

Oh, Z39.50 is a mess, and so is OpenURL. Librarian involvement is not a panacea; it's just that when they're the major users of a standard, it's counterproductive not to find a good one and include him or her.

Stating the obvious with regard to domain expertise: researchers have it! Is more truly necessary?

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…