Avoiding roach motels

By dsalo on December 8, 2009.

The latest issue of the International Journal of Digital Curation is out; if you're in this space and not at least watching the RSS feed for this journal, you should be.

I was scanning this article on Georgia Tech's libraries' development of a data-curation program when I ran across a real jaw-dropper:

One of the bioscientists asked the data storage firm used by one of the labs recently about the costs associated with accessing data from studies conducted a few years ago. The company replied, "you wouldnât want to pay us to do that. It would be less expensive to re-run your experiments." (p. 88)

Ouch. The immediate question springing to mind is "why is this lab paying these people to store data if the stored data then become unretrievable by the original depositors?" Roach motel: data goes in, but it doesn't come out!

It seems to me that the lesson here is making even seemingly-obvious requirements explicit when expensive service provision is in play. You would think that retrieval is an automatic concomitant of storage. I sure would. Apparently not!

I've run into similar problems before, but in the best story I have on the subject, the problem was format-related. I retell the story in order to warn people to be wary of hotshot black-box content-management systems.

I once worked for a scholarly-publishing service bureau. The company did editorial work, typesetting, art, design, and SGML/XML-based workflows, which is the division I was in. So one of our SGML clients was shopping for a content-management system to manage and archive all their publishing material. Sensible enough. They specifically asked each vendor whether they could retrieve the same SGML from this system that they'd put into it.

The vendor they eventually contracted with assured them that they could. Not to put too fine a point on it, the vendor was telling untruths. The SGML was munged on ingest into whatever unholy lossy proprietary mess the CMS used natively, and could not be retrieved intact therefrom. Our client didn't find this out until after the purchase, of course. There was talk of lawsuits; I don't recall where that went.

Slight happy ending: our shop had its own project-archiving procedures, so the client didn't lose any SGML that we had provided them with.

Don't let any of this happen to you! Ask questions that seem stupid, and make your counterpart commit to the answers you want.

More like this

I recently heard a similar comment while interviewing a senior bioinformatics researcher about data needs. His argument was that, as long as one still had the sample, the continual increases in the speed and resolution of the instruments meant that re-analysing would lead to better data. Of course, this still leaves the 'integrity of the scholarly record' issue, but it is clear that for some disciplines the store vs. re-analyse issue is not as clear cut as one might think.

Right, some sciences actually prefer regeneration of data to archiving it, and if that's what floats their boat, it's fine with me.

I'm just appalled that a lab went to the trouble to be responsible with their data and then couldn't get it back because the storage provider held it to ransom. That's scary.

This is going to become a problem of ever increasing frequency over time. It's not just the top level formatting (CMS) one needs to be concerned with but also the media it is written to and the formatting of that media. I have some micro tapes sitting in my desk here that I might as well through away as their are no drivers existing anymore to run the proprietary tape machine that recorded them (it's my old FidoNet BBS so I only hang on to them for sentimental reasons). Worse still are programs written in languages that are no longer in use and in a few years no one will have a clue how to get at the data. Then there's the problem of the actual composition of the media it is written to, much of which is already starting to deteriorate. As bad as it is for the environment there's good things to be said about printing everything off on acid free paper and storing it :)

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…

More like this

We're moving!

Belated Zombie Day post

Promoting a comment: "Open and shared format"

Small fry, blogging networks, and reputation

I'd love to dance with you, but...

Botanical Wednesday: Reminds me of tentacles…

Gary Taubes (Good Calories, Bad Calories) to speak at Duke

Weekend Diversion: The Andromeda Project