Data Production and Release: We Need to Change Incentives

ScienceBlogling Revere calls for an open data policy for federally-funded research (italics mine):

We've inveighed often here about the shameful practice that many senior and well-respected flu scientists have of keeping their sequences private until they publish -- if they publish using them. If not, no one gets to see them, even if we paid with tax money to collect them. The motives are often unselfish -- a senior scientist trying to protect post-docs or grad students from being scooped. Very Old School. This is the 21st century. We have our own students and we take mentoring very seriously. And one of the things we teach them is that if they have information of importance to public health, then it is to be made public. You don't make any deals with anyone that you will keep it confidential. Period. And you don't keep hold of it on your own initiative, either. Influenza virus sequences are matters of public health importance. If you are worried your career or the career of your students or post docs will be harmed by releasing them as soon as practicable, then you are in the wrong field. Choose a field or a virus where it doesn't matter. But keeping those sequences private is part of the "culture of the discipline." And it needs to change.

I've discussed this before, and I agree that data needs to be free. But what we need to ensure is that the incentive structure for scientists accurately reflects this policy. In the real world, scientists are not typically not rewarded for producing data--not directly anyway*. We are rewarded for producing publications, which lead to funding, which in turn leads to more publications, and so on**. For most academic researchers, the publication is vital for career reasons.

Revere offers a solution I like (italics mine):

As an epidemiologist it can take me years of hard work to collect data. I want to use that data and reap its benefits, both for public health and for me personally and my students and post docs. That doesn't mean I get to hoard them. It means that I have to use them in a timely way. I have an advantage over everyone else because I know the data better than they do and I have it before they do. But I don't have any ownership rights over it. If someone else can use my work, that's what science is all about. Making it available and accessible should be part of the culture of my discipline. It isn't, sad to say. But what should also be part of the culture is that if I use someone else's data (or vice versa), data made accessible to me by virtue of a granting agency's policies, I should give full credit to those who collected it and that credit should count in terms of academic appointments and promotions.

Exactly! Data generators should be included as co-authors, at least within a certain time period after data release--the length of that window might have be to discipline-dependent. We have to create incentives, or at least, remove disincentives for sharing data more rapidly. Many fields would benefit from this.

*I work at an institution where our funding is primarily predicated on data generation, not publication (although publication is good!).

**What is this 'teaching' thing you speak of?[/snark]

More like this

I fully agree. In linguistics, we use what are known as corpora. (Computers are finally able to let us dive into vast quantities of language and not go insane.) The major corpora of English are the BNC and COCA, both of which are available freely online.

If such corpora were not freely and easily available, linguistics would surely be suffering.