A scientific ethics of code

I'm a scientist and my research is supported by NIH, i.e., by American taxpayers. More importantly, the science I do is for anyone to use. I claim no proprietary rights. That's what science is all about. We make our computer code publicly available, not just by request, but posted on the internet, and it is usable code: commented and documented. We ask the scientists in our program to do the same with the reagents they develop. Reagents are things like genetic probes or antibodies directed against specific targets mentioned in the articles they publish. There is an list of the reagents on the internet and instructions on how to get them if you are another researcher. Since giving you the link would also reveal the identity of one of the reveres, you'll just have to trust me that this is true. It is. And I mention it because I am in full agreement with a piece in The Guardian [UK] by Darrell Ince, a professor of computing at The Open University in the UK (hat tip Slashdot):

One of the spinoffs from the emails and documents that were leaked from the Climate Research Unit at the University of East Anglia is the light that was shone on the role of program code in climate research. There is a particularly revealing set of "README" documents that were produced by a programmer at UEA apparently known as "Harry". The documents indicate someone struggling with undocumented, baroque code and missing data - this, in something which forms part of one of the three major climate databases used by researchers throughout the world.

Many climate scientists have refused to publish their computer programs. I suggest that this is both unscientific behaviour and, equally importantly, ignores a major problem: that scientific software has got a poor reputation for error. (Darrell Ince, The Guardian)

I do not have a moment's doubt about the basic science of climate change. There are too many convergent lines of evidence and some really convincing science to back it up. But like a lot of science -- including a great deal of molecular biology and sophisticated engineering and much else -- it depends on complex computer code that can't be checked or verified because it isn't made available to other scientists.

One of the things we know about software -- even critical software that runs important medical devices like radiation therapy machines -- is that it is frequently in error. Checks of commercially produced software has found a high rate of error and inconsistency. Imagine what you'd find with software produced by academic researchers who aren't software engineers. The trouble is you can't find it because the code isn't always made available.

I feel the same way about data. As an epidemiologist there are some problems related to subject privacy with our data sets, but they can be overcome. Many of my colleagues object to releasing their data sets for a different reason: usually it has taken them years and a great deal of money to collect and they don't want someone else scarfing it up without lifting a finger and using it to scoop them. My colleagues -- and I -- want first crack at it. The same thing is true for sequence data in virology and other disciplines. I'm sympathetic because I'm in the same boat, but I think this can be dealt with, too. One way would be to grant a grace period before requiring release of data to allow the scientist who collected it to use it. Once published it must be made available, preferably as online supplementary material accompanying the research where it is used. Another solution would be to have some requirement for crediting the data collector via authorship or data origination credit, credit that would count for academic or professional purposes like promotion and tenure.

Whatever the solution, the principle should be that scientific data, like other information, wants to be free and has an even greater claim because science is an open process. It can't be if the tools that generate the data and the data themselves are not accessible for confirmation or verification. I agree with Ince:

So, if you are publishing research articles that use computer programs, if you want to claim that you are engaging in science, the programs are in your possession and you will not release them then I would not regard you as a scientist; I would also regard any papers based on the software as null and void.

We now use only open source statistical software suites like R because it can be checked, improved and corrected by a large community of users and by our scientific colleagues around the world. We make our own code available, too.

Because we like to consider ourselves scientists.

Categories

More like this

Well said. It seems absurd that this is even a matter for debate, but there you go.

Excellent post! I have sometimes found myself guilty of not releasing software, unfortunately. This isn't because I wish to keep it proprietary, mind you. It's only because it takes time and effort to put the code into a releasable form (due to sometimes sloppy coding).

Moving forward, it would definitely be much better scientific practice to be careful to keep my codes in reasonably-publishable form all the way through, and publish said codes online.

By Jason Dick (not verified) on 10 Feb 2010 #permalink

I can but agree with you: I'm trying to make my code and data available (code is easy, as it's mine, but data comes from other people). But there can be practical problems in, for example, curating code and data. Ideally it should be both understandable and usable "up front", which means storing it in a format that can be easily read by a large variety of packages (i.e. not in Excel and Word). This means some standards are needed, however loose.

Incidentally, are you aware that R can read data directly from web pages? I mention this because it allows both code and data to be put online so that they can be read and used easily, just source() the R code file, and let it read the data. This is how Web2.0 should be working.

What license, if any, do you publish it under? Might I recommend AGPLv3 (Affero General Public License version 3) as it maintains end-user rights (and perhaps data recipients' rights, depending on how things go downstream)? :)

Joseph: We are using GPL and Apache and CC for written stuff.

Great! :)

Ironic! I'm sorta-working with a group that has taken the publicly-available NASA GISS code and translated it from FORTRAN to Python (click my name for the website). The plan is to complete a direct translation, and then work on clarifying the code so that the steps can be documented and elaborated. We're also working on developing some visualizations for it so that the data can be sliced and diced any way a viewer could want.

The NASA GISS people have even expressed an interest in moving over from their FORTRAN code to our Python code!

Makes sense to me! I wanted to build on a model published in the last couple of years but couldn't replicate the original results. I agonized over it for so long - but while sufficient information was published so that I should have, in principle, been able to reproduce the original code, I never succeeded in replicating the results (quite possibly my own fault - I'm no coding expert). The authors said they lost the code....

R is fantastic, and together with sweave and latex, it permits the ultimate in literate statistical programming and transparent authoring of research papers. It is a crying shame that many medical journals are still so antediluvian to demand MS Word, which positively invites cut and paste errors in tables and in-text figures.

I thought molecular biology was mostly done in photoshop?

Ah, but as alluded briefly to at #3, sometimes the data is owned by someone else. This is particularly problematic in climate research, where a lot of data belongs to various meteorological services around the world. Very often, these services are required by law (misguided law for sure, but law all the same) to make a profit from their data, so they can't easily release it. Should researchers be barred from using such data? My guess is that such a requirement would have been a great hindrance to climate science. On the other hand, it would made it possible to put a great deal of pressure on these governments to make the data available.

Harald: There are a number of gray areas and difficulties, more, probably, in my own field of epidemiology than most fields. But the general principle should be fought for. And if you publish something that no one can check, it's asking a lot, essentially just saying "trust me." So I'm not asking for the past to change but the future. There are some kinds of studies I'd like to do but I can't any more because they are considered unethical or journals require me to divulge a conflict of interrst or whatever. That's the science world we should move toward, and while we can explain the world we used to live in it's not that relevant. And it has produced all sorts of problems which aren't false problems but real ones.

Revere, is the problem mentionned by Harald Hanche-Olsen similar to the problem of gene patenting, where only a few people are privy to the knowledge of the invention?

@perceval: Wow, I thought I was the only user of R/sweave/latex who wasn't also a developer of sweave. It's nice to see more use of LP techniques outside traditional programming.

Numerical programming seems to be a fading art. Releasing scientific source code is low risk - the few people interested in the code are usually competent enough to improve them, or if not, have a good source of material to learn from.

Sadly, a lot of radiation codes are rotting behind the paywall of RSICC at ORNL. Licensing fees are obscene and the move to personal licensing & liability for export control is egregious. Nothing against RSICC or its mission, but it could be replaced with a Subversion server & Trac in a month, cheaper, faster, and more secure than the 80s-era mindset and toolset they're mired in now. It's probably easier to rewrite ORIGEN from scratch than get RSICC/ORNL to change.

Alex: No. Since you are in math, it would be like having the 4 color problem being published with a general description and a note that says the computer program indicated 4 colors suffice but not giving anyone the program code.

Revere, I am sorry, but you are buying into the denialists' claims about the lack of sharing - most computer code and data is freely available, if you spend just a short time looking for it.

See RealClimate's post on the subject (which links to a page they have created containing links to all the data and code available). Most of that stuff has been available all along, and still the denialists have demanded access to it.

Kristjian: The post wasn't about climate science. It was about computer code being available. I made no claims. I did use Ince's pull quote which was only obliquely about availability of the climate code and more about the state of code. I think the thrust of the post is quite clear and my views of climate science are clear, too.

can someone please score the scientists and papers
by their amount of secrecy and make the list
available on the web ?
Papers with unpublished data should get a mark,
so they can be easily filtered with the search-engines

if someone publishes the data of a previously published
other paper, where the data is already explained,
does that make a new article=paper ?
Gives rise a new job of data-hunting/revealing

Darrell Ince concludes that if a code isn't released, he regards "... any papers based on the software as null and void."

This is frankly a load of nonsense. For most of the history of scientific computing in physics, most research papers did not publish code. Nor was there much demand for anyone to release code. The reason is that anyone who wanted to reproduce the results was expected to write their own *independent* computer code. This provides a far better check than having another scientist use and/or attempt to check the *same* piece of code. Anyone who has attempted to understand computer code writen by another scientist can easily verify that it is almost always easier to simply write a new code, which also has the virtue that it is completely independent.

This brings me to a crucial point that Ince overlooks: whilst scientists often do not publish or release the actual code, it is absolutely essential to publish the algorithm that the code is implementing.

So whilst physicists almost never published their actual codes, they *always* describe the numerical problem that the code solves, so that someone else can code it up independently.

So in the four colour map problem, sure it is nice to publish the actual code and it is nice for computer scientists to check it and claim that is is flawless. But it is far more important that the *algorithm* is described completely and that *independent* codes are tested against the original. Then I start to become reassured that the results are correct.

Let us assume that a computer code contains a subtle error. If you are merely checking the existing code, you are forced to try to understand how the code works and in doing so it is all too easy to fall into the same subtle traps of reasoning that the authors of the code fell into in the first place.

So Darrell Ince only has part of the picture here and I think frankly that this reflects a computer scientist's viewpoint, in which the main object of study are computers and software.

In contrast, a natural scientist who *uses* computers can (and must) test the behavior of code against scientific principles, known results and special cases. If these sorts of tests were done adequately, most of this issue would go away.

High level computing languages such as Mathematica make it possible to quickly and easily construct independent codes. These codes, whilst much slower than Fortran, do allow direct "spot checking" of published results and I submit this would be much more productive than poring over reams of Fortran or C code looking for errors.

By Oz Observer (not verified) on 12 Feb 2010 #permalink

Revere, I understand that your post was not about climate change, and that you accept the science behind it, but it still buys into the false premise that data and code is not made available - this doesn't appear to be the case.

Shouldn't we try to ascertain that there is a problem before trying to address the problem?

Oz Observer, 19:

I think you have a half-good point. I'm sure that for some fields, in which the required code is not too big, having everyone rewrite code is do-able. However:

It's eventually as unreasonable to expect each researcher to rewrite all required code from scratch as it is to expect each researcher to make all the reagents, standards, substrates, machinery... stop where?

It is possible to learn to read code discriminatingly, not be seduced by the fact that it seems to work. Note that, if this becomes the standard, we know *why* two scientists get different computed results. With independent, private codes, we don't.

As a middle ground, one inherited from mathematicians of the scrappy Enlightenment, we could publish our test cases -- sets of inputs and outputs, usually the analytically-soluble bits of the algorithm but sometimes well-accepted observations, that define a code as correct. (Note that this would require us to *have* test cases.)

chlewis: I am in complete agreement. The point is not that we should rewrite code -- no one does that, we use recipes and blocks of code we know work -- but that the code not be a "secret sauce" whose ingredients can't be or aren't divulged. For that, the code has to be open and not proprietary. If you won't make your code available it is the same as saying you won't make your reagents available or tell anyone how you made them. That's not acceptable. And I agree that bench mark data sets used for testing should also be made available. We do exactly that. This way new code can be compared to old code.