Avoiding roach motels

The latest issue of the International Journal of Digital Curation is out; if you're in this space and not at least watching the RSS feed for this journal, you should be.

I was scanning this article on Georgia Tech's libraries' development of a data-curation program when I ran across a real jaw-dropper:

One of the bioscientists asked the data storage firm used by one of the labs recently about the costs associated with accessing data from studies conducted a few years ago. The company replied, "you wouldnât want to pay us to do that. It would be less expensive to re-run your experiments." (p. 88)

Ouch. The immediate question springing to mind is "why is this lab paying these people to store data if the stored data then become unretrievable by the original depositors?" Roach motel: data goes in, but it doesn't come out!

It seems to me that the lesson here is making even seemingly-obvious requirements explicit when expensive service provision is in play. You would think that retrieval is an automatic concomitant of storage. I sure would. Apparently not!

I've run into similar problems before, but in the best story I have on the subject, the problem was format-related. I retell the story in order to warn people to be wary of hotshot black-box content-management systems.

I once worked for a scholarly-publishing service bureau. The company did editorial work, typesetting, art, design, and SGML/XML-based workflows, which is the division I was in. So one of our SGML clients was shopping for a content-management system to manage and archive all their publishing material. Sensible enough. They specifically asked each vendor whether they could retrieve the same SGML from this system that they'd put into it.

The vendor they eventually contracted with assured them that they could. Not to put too fine a point on it, the vendor was telling untruths. The SGML was munged on ingest into whatever unholy lossy proprietary mess the CMS used natively, and could not be retrieved intact therefrom. Our client didn't find this out until after the purchase, of course. There was talk of lawsuits; I don't recall where that went.

Slight happy ending: our shop had its own project-archiving procedures, so the client didn't lose any SGML that we had provided them with.

Don't let any of this happen to you! Ask questions that seem stupid, and make your counterpart commit to the answers you want.


More like this

This post is intended for Dan Cohen and Tom Scheinfeldt's crowdsourced Hacking the Academy book. Arguments about open access usually appeal to altruism, tradition, or economics. Even arguments supposedly aimed at researcher self-interest strike me as curiously abstract, devoid of useful example. I…
I was reading the latest issue of the Journal of Digital Information today, and I found myself wishing I could turn the Readability bookmarklet loose on half its PDF-only articles. I'm sorry, authors. I know you tried, but those PDFs are terrible-looking. Times New Roman, really? (The one in Arial…
One of the problems practically every nascent data-curation effort will have to deal with is what serials librarians call the backfile, though the rest of us use the blunter word backlog. There's a lot of digital data (let's not even think about the analog for now) from old projects hanging around…
I've lived all my short career in academic libraries thus far on the new-service frontier. In so doing, I've looked around and learned a bit about how academic libraries, research libraries in particular, tend to manage new services. With apologies to all the botanists I am about to offend by…

I recently heard a similar comment while interviewing a senior bioinformatics researcher about data needs. His argument was that, as long as one still had the sample, the continual increases in the speed and resolution of the instruments meant that re-analysing would lead to better data. Of course, this still leaves the 'integrity of the scholarly record' issue, but it is clear that for some disciplines the store vs. re-analyse issue is not as clear cut as one might think.

Right, some sciences actually prefer regeneration of data to archiving it, and if that's what floats their boat, it's fine with me.

I'm just appalled that a lab went to the trouble to be responsible with their data and then couldn't get it back because the storage provider held it to ransom. That's scary.

This is going to become a problem of ever increasing frequency over time. It's not just the top level formatting (CMS) one needs to be concerned with but also the media it is written to and the formatting of that media. I have some micro tapes sitting in my desk here that I might as well through away as their are no drivers existing anymore to run the proprietary tape machine that recorded them (it's my old FidoNet BBS so I only hang on to them for sentimental reasons). Worse still are programs written in languages that are no longer in use and in a few years no one will have a clue how to get at the data. Then there's the problem of the actual composition of the media it is written to, much of which is already starting to deteriorate. As bad as it is for the environment there's good things to be said about printing everything off on acid free paper and storing it :)