One phenomenon that will be—indeed, already is—utterly unavoidable in the data-curation space is the creation of standards. I once heard Andrew Pace say that standards are like toothbrushes: everybody thinks they're great, but nobody wants to use anybody else's.
Be that as it may, standards development and compliance is one way to make everybody's data play nicely with everybody else's data. It's not the only way, to be sure; one very important way that I'm sure we'll also see more of is Being The Only Game In Town. ICPSR manages this quite successfully, and so does the Digital Sky Survey. If you want to be important in the data spaces dominated by either of these large players, you play by their rules, just that simple.
When there's no big player to lay down the law, though, standards development becomes more attractive. How do you make a standard, then? More to the point, how do you make a good standard, a standard that works, a usable standard, a standard that will last?
I liked this blog post by Adam Bosworth about standards development very much. I think it captures much of the excellence that goes into successful standards as well as the dysfunction attending failed ones. I do want to add a fillip of my own, though, based on my own experience helping to build standards and trying to use standards built by other people.
When you're in a roomful of people tasked with building a standard, make sure the room contains representation from every group of people who will be asked or required to use it. That emphatically includes the non-technical and the non-specialist. It goes double or triple if the standard will affect existing technology installations: you must have someone in that standards room who uses the existing technology! No, a developer of the existing technology does not fulfill this requirement, because the distance between developers' understanding and users' understanding is often vast.
If the non-technical, non-specialist representative in the room can't understand the standard, it will fail. If that representative can't produce data that fit the standard, likewise. I agree with Bosworth's reservations about RDF; I myself have trouble understanding it and putting it to use, despite a decade's experience with markup, and I believe the tribulations such folk as I face when trying to deal with it have retarded its adoption significantly.
What happens when this rule about representation is flouted, but standards are published anyway, is standards that fall apart under real-world use. I will adduce OAI-PMH as an example. It follows quite a few of Bosworth's recommendations: it's simple (I have explained it in twenty minutes to library-school students), largely human-readable, focused, precise about encodings, in possession of real implementations, and free on the web.
It is also flawed. Huge projects built on it have found its flaws impossible to bypass and expensive to work around (see Lagoze et al. 2006 for how NSDL ran aground on OAI-PMH's inadequacies).
The major flaw, to my mind, isn't difficult to explain or to understand: OAI-PMH has no error-reporting built in. In a protocol standard built for communication of and about metadata, nobody in the standards-design process ever seems to have asked the (to me) simple and obvious question, "What happens if the metadata is malformed or otherwise wrong?"
Anyone who's worked on the ground with repositories of any stripe knows that metadata problems, sometimes gross problems, are par for the course. For that matter, any librarian can explain the pitfalls of metadata and citation creation at great length. I honestly can't tell you why OAI doesn't seem to have on-the-ground repository managers and other librarians capable of raising such practical issues working on its standards bodies.
I can, however, tell you that they should. The latest OAI development, OAI-ORE, contains exactly the same no-error-reporting weakness I just pointed out in OAI-PMH. Yes, some of the underlying technologies that OAI-ORE is built on contain certain kinds of error reporting, but the aggregation of those errors that can be reported is only a subset of the errors that I believe will crop up.
To make standards that work, include people on the standard-design team who work with the processes underlying the standard. Now that you know this—go forth and standardize!
Thanks Dorothea for another nice post. My main comment will relate to some of your other discussions (and my main interest) on data curation. But in passing I'll comment that an earlier standard for metadata sharing - Z39.50 - while almost certainly built with librarian involvement, was also significantly flawed, resulting in "virtual union catalogues" (I'm sorry I coined the term "virtual clumps", for shame!) that frequently failed with known item searches (because of multiple levels of mapping). My take here is that making standards that really work is just plain hard. The IETF "rough consensus and working code" approach looks the best, but there are still plenty of unsuccessful IETF RFCs.
I had not seen the Lagoze article you pointed to, for which thanks. It's doubly interesting as Lagoze was one of the main participants in creating OAI-PMH. I particularly liked this bit:
"But more problematic was the reality that the personnel requirements to share metadata were deceptively high due to what can be characterized as a âknowledge gapâ. Successful provision of metadata actually involves three distinct skill sets:
1.Domain expertise â knowledge of the resources themselves and their pedagogical goal.
2. Metadata expertise â knowledge of cataloging practices such as use of controlled vocabularies and proper formatting of data such as names and dates.
3.Technical expertise â knowledge of tools involved in setting up and running an OAI-PMH server including XML, XML schema, UTF8, and HTTP.
We found that very few NSDL collections had a single person, let alone a team, with these three skill sets."
I think that combination in slightly different form could describe what's needed for building a workable data repository/data curation service in an institution. It's clearly more than library skills, more than IT skill, and more than domain science skills, but somehow combining all of them. The problem is that the library and IT skills can be acquired and shared in a scalable way, but how do we make the domain skills scalable?
Oh, Z39.50 is a mess, and so is OpenURL. Librarian involvement is not a panacea; it's just that when they're the major users of a standard, it's counterproductive not to find a good one and include him or her.
Stating the obvious with regard to domain expertise: researchers have it! Is more truly necessary?