Rare is the occasion when I disagree significantly with my collaborator Steve Novella, but this is one of those times. It's a measure of how much we agree on most things that, even in this case, I don't completely disagree with him. But, hey, it happens. I'm referring to Steve's post yesterday in which he gushed over the new policy at PLoS (Public Library of Science) regarding articles published in their journals. (Steve rarely gushes.) Here's the policy:
In an effort to increase access to this data, we are now revising our data-sharing policy for all PLOS journals: authors must make all data publicly available, without restriction, immediately upon publication of the article. Beginning March 3rd, 2014, all authors who submit to a PLOS journal will be asked to provide a Data Availability Statement, describing where and how others can access each dataset that underlies the findings. This Data Availability Statement will be published on the first page of each article.
You'll note that PLoS has significantly revised its announcement since Steve's post. In any case, Steve argued that this is a "fabulous idea," pointing out how such a policy could help with challenges such as "publication bias, the literature being flooded with preliminary or low quality research, researchers exploiting degrees of freedom (also referred to as “p-hacking”) without their questionable behavior being apparent in the final published paper, conflicts of interest, the relative lack of replications and lack of desire on the part of editors to publish replications, frequent statistical errors and the occasional deliberate fraud." He had a point, but a lively discussion broke out in the comments that, I think, surprised Steve. Not everyone, myself included, was quite as enthusiastic about this new policy as Steve was. This was likely due to a combination of factors, including the vagueness of the PLoS policy, concerns about protecting research subject confidentiality for human subjects research, and the impracticality of following the policy for some types of experiments. These problems were obvious to commenters who actually run labs and do research (like myself), less so to those who did not. Indeed, I noted a rough negative correlation between the level of enthusiasm for this policy and the amount of experience doing actual research.
I think I can suggest why this is by "cherry picking" a couple of problematic parts of the policy. For example, the policy defines data that must be shared thusly:
PLOS defines the “minimal dataset” to consist of the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. Core descriptive data, methods, and study results should be included within the main paper, regardless of data deposition. PLOS does not accept references to “data not shown”. Authors who have datasets too large for sharing via repositories or uploaded files should contact the relevant journal for advice.
First off, let me note that I agree that, in these days of supplemental data sections posted online, "data not shown" is no longer acceptable in a research paper. There is pretty much no reason I can think of that any "data not shown" notation couldn't be changed to "see supplemental data section," with the data that formerly wasn't shown being deposited right there. The whole "data not shown" thing is a holdover from the days before all scientific papers were made available online as PDFs and full text documents, when space limitations required that the number of figures and the amount of text be limited. Data not considered essential to the findings reported in the scientific paper could be described as "data not shown." It's not as bad as it sounds. Usually, the data described as "not shown" was seen by the reviewers, as it was usually included with the manuscript. It just wasn't published.
But what is the "minimal dataset"? This is not a trivial question. I've published papers in which experiments have been done multiple times over several years, starting out with preliminary experiments repeated over and over again to work out the bugs and get the methods to work reproducibly, followed by the "real" experiments, the ones that ultimately end up in the final manuscript. Do I include all those messy, preliminary experiments? What about basic molecular biology studies? Is PLoS going to require, for instance, that the original, uncropped, autoradiographs be included in the supplements, for instance? (Yes, we still do autoradiographs and use film in our labs to detect bands on gels through chemiluminescence.) Original lab notebook analyses, either copies or transcribed to print? Or the step, by step, analysis of data, which in some cases can be many, many pages long? Data transparency is great in concept, but when you start considering the nuts and bolts of what, exactly, data transparency means, it gets very, very messy very quickly. As was pointed out in the comments, the policy as currently written is so vague as to be almost completely unenforceable, which is why it’ll be really interesting to see what gets dumped in those supplemental data sections.
Unfortunately, PLoS's "clarification" does anything but clarify:
This does not mean that authors must submit all data collected as part of the research, but that they must provide the data that are relevant to the specific analysis presented in the paper.” The ‘minimal dataset’ does not mean, for example, all data collected in the course of research, or all raw image files, or early iterations of a simulation or model before the final model was developed. We continue to request that the authors provide the “data underlying the findings described in their manuscript”. Precisely what form those data take will depend on the norms of the field and the requests of reviewers and editors, but the type and format of data being requested will continue to be the type and format PLOS has always required.
Ah, perhaps I should breathe more easily. We don't have to make our "early iterations" of a model available. On second thought: Define "early iteration." When does an iteration cease to be an "early" iteration? If the only data necessary are the direct data used in the paper, then why bother with this policy to begin with?
One issue that was brought up that probably isn't a huge consideration is that some datasets are too large to share easily. Genomics data, for instance, can easily end up taking up many terrabytes of data. There also already exist public databases into which such data can be deposited, which would clearly satisfy the PLoS policy. However, other types of data lack such public databases. One reader described how the raw data from a single color channel of a single image of super-resolution microscopy takes up 8 GB, meaning that each standard image takes up 20-24 GB. If each experiment involves taking photos of large numbers of cells, say 30 or so, then one experiment containing only the negative control and the test condition can easily reach 1 TB of data. One experiment was described describing an imaging data set of 10 TB, which cost over $1,000 to store no a RAID. In that case, will a statement that the researcher will share the original data suffice?
My guess is that fewer investigators are going to want to submit their work to PLoS journals. Indeed, I've just been through the process of submitting two manuscripts to a PLoS journal, PLoS One. I just had one paper published by PLoS One, with another one in the can to be published next month. It was a big enough pain in the rear to submit to PLoS to begin with, not even counting the $1,300 or so per manuscript in page charges due to the journal being open access. If I wasn’t sure I would be doing it again before this announcement, now I really don’t know if I will do it again, given the extra time it will take to make sure the data are available to the satisfaction of PLoS. I already am fine about providing raw data to an investigator who requests it.
Another issue I noticed was this:
For studies involving human participants, data must be handled so as to not compromise study participants’ privacy. PLOS recommends that researchers follow established guidance and applicable local laws in ensuring they do not compromise participant privacy. Resources which researchers may consult for guidance include:
Steps necessary to protect privacy may include de-identification, blocking portions of the database, or license agreements directed specifically at privacy concerns. Authors should indicate, as part of the ethics statement, the ways in which the study participants’ privacy was preserved. If license agreements apply, authors should note the process necessary for other researchers to obtain a license.
This policy is a bit naive. De-identifying the data would not be guaranteed to adequately protect the identities of clinical trial subjects, at least among hospital staff and others who might deal with them or friends, family, or acquaintances who might put together measurements and dates to figure out which subject is whom. While that might seem harmless, it would nonetheless be a violation of HIPAA privacy regulations, which do not allow exceptions for curious family members or hospital staff. Yes, the chances of this happening are low, but when data are available to anyone (i.e., are public) the chances of this happening can't be ignored. And, yes, there are examples of successful anonymization of data for sharing data sets, but, as is noted here, it is "time consuming and therefore costly." It would require that clinical trials be designed from their very inception with data sharing in mind and the informed consent that patients sign mentioning that the data will be shared. This has the potential to be a good thing in principle, but again the devil is in the details. it's also financial. Funding sources already barely provide enough funding to do this research—and often they do not, at least not completely. In the absence of increased funding to do this, it's a burden on researchers.
All of which is probably why PLoS backtracked:
Like some other types of data, it is often not ethical or legal to share patient data universally, so we provide guidance on the routes available to authors of such data, and we encourage anyone with concerns of this type to contact the journal they would like to submit to, or the data team at email@example.com.
But the original policy formulated doesn't give me a great deal of confidence that PLoS knows what it's doing with respect to clinical trials confidentiality.
I don't know if I completely agree with the ever-irascible Drug Monkey (one of the only researchers I've encountered whose tendency towards Insolence approaches my own), when he referred to the new PLoS policy as "letting the inmates run the asylum" and "whackaloonery," but he does make some good points. He prefaces his complaints by discussing how he thinks PLoS, through its policies on animals, basically tries to sidestep the local IACUC (the ethics committee that approves animal research), and that complaint rings somewhat true. I remember that the statements about animal research that PLoS made me sign did make me wonder whether approval of my animal experiments by my university's IACUC was going to be adequate. He also makes a legitimate point about "self-plagiarism" with respect to methods sections. Personally, like many scientists I, too, recycle huge swaths of my methods sections, because the methods for each technique are the same. Only the reagents, DNA constructs, and specific drugs and doses vary. The overally assays and techniques tend to be the same or very similar. It just doesn't make sense to have to rewrite them every time.
Those complaints, however, have nothing to do with the current question about data "openness." DrugMonkey's first complaint is this:
The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time, to address the legitimate sins of the few. The scope of the problem hasn’t even been proven to be significant and we are ALL supposed to devote a lot more of our precious personnel time to data curation. Need I mention that research funds are tight and that personnel time is the most significant cost?
I know tight funding. There are only two people in my lab now, me and my lab manager.
The one that resonates with me is this one, which I came up with independently, albeit with a different emphasis, as you might imagine given the usual blogging topics I take on:
Fourth problem- I grasp that actual fraud and misleading presentation of data happens. But I also recognize, as the waccaloons do not, that there is a LOT of legitimate difference of opinion on data handling, even within a very old and well established methodological tradition. I also see a lot of will on the part of science denialists to pretend that science is something it cannot be in their nitpicking of the data. There will be efforts to say that the way lab X deals with their, e.g., fear conditioning trials, is not acceptable and they MUST do it the way lab Y does it. Keep in mind that this is never going to be single labs but rather clusters of lab methods traditions. So we’ll have PLoS inserting itself in the role of how experiments are to be conducted and interpreted! That’s fine for post-publication review but to use that as a gatekeeper before publication? Really PLoS ONE? Do you see how this is exactly like preventing publication because two of your three reviewers argue that it is not impactful enough?
DM is referring to the mission of a PLoS journal, PLoS One, which is to be different from other journals in that it will publish any well-conducted science without any assessment of whether the results are "important" or not; in other words, to be a repository for all science. I thought of DM's concern from a different angle, one that I'd think of because of my usual blogging topics. (Actually, I’m not sure they thought through the consequences of this all that well.) The cranks, quacks, and antivaccinationists will have a field day with this. They already do their damnedest to get the original datasets for various studies they don’t like, the better to “analyze” them the way they want or to find flaws in them. You could argue that, knowing that anyone, including cranks, can see their original data will motivate scientists to produce a higher standard in their publication. Maybe so in some cases. In fact, probably so in some cases. However, what’s more likely to happen in most cases is that scientists in controversial fields frequently attacked by cranks just won’t publish in PLoS journals anymore because, however, rigorous their analyses are, they’ll have to put up with the hassle of cranks”re-analyzing” their data to discredit them. Most scientists care far more about what other scientists think about them than what cranks do, but on the other hand it’s understandable not to want the hassle of dealing with, say, antivaccinationists. I know I wouldn't if I did vaccine research.
A willingness to share data is, without a doubt, one of the highest ideals of science. However, it isn't as simple as a journal (or journals) mandating it. One can argue that one of the bigger flaws in science as it is practiced now is that it lacks the infrastructure and agreed-upon methodologies to make sharing easy and expected. To the extent that PLoS has started the conversation on how to work towards this goal, I'm with its editors. However, right now the effort strikes me as half-baked. I want to get behind it, but right now I feel that PLoS is using a blunt instrument that strikes me as not having been that well thought out.
Sorry, Steve. We can't always agree, at least not completely.
First off, let me note that I agree that, in these days of supplemental data sections posted online, “data not shown” is no longer acceptable in a research paper.
The industry apologists at the "Scientific Kitchen," who naturally fancy themselves in the role of gatekeepers, disagree.
Of course, Anderson here also informs readers that Twitter "drives creativity."
One question that does arise in this context is whether supplementary material is supposed to have the journal's seal of approval; I came across a case recently in which it was expressly disclaimed. On the other hand, the old gig did endorse them and canonically formatted tables. Unfortunately, they weren't sanity-checked, and there was no automation in place to propagate even minor changes in the main paper -- say, the title -- to them. Needless to say, this was a headache when I had to go through an issue's worth, on top of catching the content errors (missing/out-of-range values, etc.) in the ones that I "owned."
^ "Scientific Scholarly Kitchen"
As long as scientists will be evaluated by counting papers, there will be a need to check that they don't cheat. But, in addition to false data, many problems arise from wrong conclusions motivated by the pressure to publish. It is time to say that evaluating scientists by H-index causes more harm than good.
It seems like perhaps a smarter way for PLoS to approach this issue would have been to publicly define a standard of data-sharing it wished to encourage, and then having established the term in the discourse, encourage its authors to either comply with that standard or explain why it would not be feasible in their specific case.
If it successfully becomes a new norm, it's easier from there to argue that it should be the new norm. If it doesn't, well, then, good thing we didn't try to coerce everyone into it, eh?
I'm going to have to side with Dr. Novella. Data transparency is essential, and here's why: The failure of peer review.
It is unreasonable to expect a small panel of expert peer reviewers to do the job that the scientific community should do in vetting research and debating controversial findings from an informed position.
Let's take two papers as examples:
The Seralini GMO paper. Food and Chemical Toxicology 50 (11): 4221–31. There were so many unanswered questions about methodology, results, statistical testing. Somehow it passed peer review and entered the literature.
Wakefield's 1998 paper. Lancet. 1998 Feb 28;351(9103):637-41. This was the result of fraud, obviously, but that's because of a lack of transparency. Having looked at the raw data, it was obvious to anyone familiar with real-time PCR that the data was a deception. He concealed this intentional act of fraud in the paper, but the raw data shows the clear evidence (thresholds were adjusted between reference standards and samples). It took "strike-off" proceedings to bring that out.
Publication standards like MIQE (Minimum Information about Quantitative PCR Experiments) are a good start, because they force authors to be explicit in the process they went through. What is then needed is the exposure of this data transparency to the larger scientific community. In particular, knowing that statisticians can access my data will make me more likely to consult a local stats person in defense.
I know it will be cumbersome and inconvenient, and the concerns about HIPAA are very, very valid... but with the proliferation of journals, it may be the only way to stem the tide of bad papers. As an added bonus, it may make other, more banal types of fraud detectable: data duplication, for example, would be easier to detect.
That's just my opinion... I don't have to write manuscripts, though, so I suppose I don't have much skin in the game. I also hadn't considered the "chilling effect" that all this data curation would have on publication in certain journals. I rely heavily on open journals, so I hope that whatever happens doesn't interfere with the shift towards papers outside of paywalls.
I have no problem with publication standards like MIQE. I don't have problems with requirements that, for instance, RNAseq data be deposited in one of the public databases for it. I don't even necessarily have a problem with a PLoS-like requirement if it were not so onerous or if it weren't in essence an unfunded mandate and so vaguely formulated.
This kind of brouhaha makes me glad my work doesn't involve anything that's actually alive.
This policy assures that no industry conducted research is published in PLOS. It's hard enough to get enough data cleared for publication to make up a full paper (as compared to the unusual communications or conference papers). If all underlying research data has to be added it will be impossible to sign that none of the research will be relevant to some patent 5 years down the road, leading to a strong "njet" from legal.
Not being in industry, I hadn't even thought of that, but you're right. Of course, I rather suspect that PLoS isn't interested in having industry publish in its journals. On the other hand, given that industry is usually the biggest offender when it comes to lack of data transparency, that's one huge area where this policy will have zero effect while putting a big burden on researchers who try to play by the rules.
In microarray work we've been coughing up the data for about a decade without the sky falling. I submit it has been good for science and the patients, sharing data, improving honesty, and highlighting best practices. Are there details to be worked out and practicality to consider - yes. (Feldspar's suggestion is about right.) Are people who do shoddy work or publish in as many papers as possible on the same data going to oppose divulging data - yes.
" I noted a rough negative correlation between the level of enthusiasm for this policy and the amount of experience doing actual research." Let's see your data and analysis methods.
I noted a rough negative correlation between the level of enthusiasm for this policy and the amount of experience doing actual research.” Let’s see your data and analysis methods.
Don't quit your day job. You won't hack it as a comedian.
It's the way of the future, and we're going to have to get over the technical limitations of this because more and more journals will ask for it. I was at a talk last week by one of the associate editors of BJM, and he mentioned their move toward more data sharing and more openness of data. The limitations were brought up, but he said that they were just going to have to be overcome. And they probably will.
I'm not going into research with this doctorate work, but I am going into implementing that research. It's going to be key for me to be able to take the tools I've learned in Epi Methods (I, II, and III), Professional Epi Methods, and Biostats (I, II, III, and IV: The Undiscovered Country), and read the literature properly. Will that include me wanting to look at and examine the data? Probably not, but it may happen especially if the results are groundbreaking.
For 99.9% of the public, this is just another one of those "in an ideal world" things that sounds like a good idea. For those involved in research, this is a good idea that has a lot of caveats. For me, the professional epidemiologist, it's just an interesting development.
Know what I mean?
From a research ethics perspective, I have a hard time seeing any IRB approve research where subject data, even if deidentified, would be open to anyone. Suppose, though, that this became the norm. I'm willing to bet quite a bit that insurance companies would love to pore over the data. Or what about addiction research? Every citizen has the right to not incriminate themselves. Currently, a federally issued Certificate of Confidentiality allows the researchers to deny disclosure of subject information to law enforcement. This policy could be used as an end-run around such protections.
From a human subjects research protections point of view, this policy stinks.
"From a human subjects research protections point of view, this policy stinks."
I'll have to bring it up in my human research ethics class... In five minutes.
There also is the matter of who owns the data. I have been following the Global Warming debate where a few climate denialists have been harassing scientists for years. One of the more persistent and probably better financed by various industrial groups, was demanding data from, IRRC, the Hadley Centre (UK) and loudly alleging fraud, etc.
He seemed to ignore the fact that most of the data was and had been publicly available for years--these are public data sets and anybody can get them. Other data sets were proprietary sets from other national governments and Hadley Centre was contractually required to maintain confidentiality. They legally could not release them.
In the end, as always, the job of enforcing this standard will fall to the reviewers, as the editors will often lack the subject matter expertise to know what is missing from the submitted data. Although I'm in favor of pushing towards full disclosure of data, I would add that it matters a great deal how the data is shared. I have often been interested in exploring the underlying data of a paper only to discover that it is available only as a 120 page PDF. Cue the cursing and fruitless attempts to extract said data from the PDF and into something that can be imported into R. Likewise, it is ridiculous that RNA-Seq experiments are supplied as raw data in GEO, requiring anyone who wants to use the data to perform their own assembly of reads into transcript counts.
Hey, if they want raw data, then raw data is what they get.
One of the journals I am familiar with has recently announced a comparable policy on data openness. In their case, they include things like source code for any programs you wrote yourself to use the data. They do allow workarounds like putting it on your institutional website or "data available upon request". They also have considered the proprietary data angle (in that case you have to specify the data provider).
It's a significant change from how we've previously done business in my field. I think it can be done, but I don't work with human or animal subjects, and I'm not in industry or a DoD lab. Aligning a policy like this with things like HIPAA or sensitive information policies that the private sector or DoD would enforce is going to be a major headache.
So in addition to basically throwing out people from industry (as Mu pointed out above), this is also going to exclude people in DoD labs, and most people with DoD funding. That may not be a big deal in most biomedical fields (although I'm sure they do fund stuff with potential military applications). But it is a major problem for anybody at places like USAAMRID or Los Alamos.
Likewise, it is ridiculous that RNA-Seq experiments are supplied as raw data in GEO, requiring anyone who wants to use the data to perform their own assembly of reads into transcript counts.
This is in no way ridiculous. There are quite a few different RNAseq assembly and analysis methods out there, and they vary in their strengths and weaknesses.
In my opinion, anyone who doesn't go and map the raw data themselves likely has little appreciation for how much this process can affect your analysis, and will end up drawing conclusions that don't take these nuances into consideration.
I have often been interested in exploring the underlying data of a paper only to discover that it is available only as a 120 page PDF.
Be careful what you wish for. You just might get it.
this is also going to exclude people in DoD labs, and most people with DoD funding. That may not be a big deal in most biomedical fields (although I’m sure they do fund stuff with potential military applications)
My lab actually does work on a DoD funded grant involving military samples, and we're grappling with this very problem right now.
I struggle with the open data issue on a daily basis. I think the benefits of data access are obvious to most. But modern genomic data is WAY more identifiable than most people think. This problem is only going to get worse as genome sequencing continues to enter medical practice. Personally, I think the best solution for clinical genomic data is a multi-level controlled access model where summary data is publicly available to all, and raw data is available through dbGaP to those who demonstrate a genuine need or desire to use the data.
Sorry, I'm not going to put digitized interviews into a repository where the anonymity can't be guaranteed.
Wow. I've been sitting on revisions for a manuscript I intended to submit to PLoS ONE, and now that I'll need to submit about 50 CDs of micrograph images and scan several lab notebooks, this is definitely not a task I can fit in while trying to run a business.
Now I really feel horrible about not finishing this up for my PI's benefit. (Somehow I don't think publications are relevant to my new career in laser-cut artwork.)
If the aim of science is industrial production of data, then it is no surprise that there is some quality control on what is produced. The problem is that there is no more control on theories, as they are not clearly formulated, are trying to fit the data of individual papers, and follow fashion independently of the fact that they are falsified of verified.
You've already posted this thoroughly uninteresting link. Michael Spivak's "Publish or Perish" press has been around for decades. It's not a new concept. And ResearchGate is going nowhere.
I hope you're going somewhere. I don't follow you.
Daniel, the scientific method requires data collection, followed by rigorous analysis. Otherwise, it's not science but philosophy or discussion or mental masturbation or Twitter (in descending order of mentation). The more complicated the science, the more data which must be collected. You can sneeringly call it "industrial production of data", but that does not change the need. Theories or rather "hypotheses" (to those of us who work in scientific fields, thank you) are not controlled by some board of super-intellects on the basis of purity- they are formulated, discussed, analyzed, and retained or discarded if they do/do not not fit the observations. Fashion does not come into it. Your quips about "control" and "fashion" have solidified for me why I have come to dislike reading your comments- they denigrate the ethics of scientists in a backhanded manner, without forthrightly and honestly making open statements that you will have to defend. I think you are willfully rude and offensive in a furtive manner.
Have you really read what I wrote? Do I say that data are not necessary? I disagree with your conclusion that the more complicated the science, the more data must be collected. What I question is the quantitative, bureaucratic, evaluation of science. For the rest, if I look offensive, this is certainly in the furtive way ;-)
"I’m not going into research with this doctorate work, but I am going into implementing that research. It’s going to be key for me to be able to take the tools I’ve learned in Epi Methods (I, II, and III), Professional Epi Methods, and Biostats (I, II, III, and IV: The Undiscovered Country), and read the literature properly. Will that include me wanting to look at and examine the data? Probably not, but it may happen especially if the results are groundbreaking.
For 99.9% of the public, this is just another one of those “in an ideal world” things that sounds like a good idea. For those involved in research, this is a good idea that has a lot of caveats. For me, the professional epidemiologist, it’s just an interesting development.
Know what I mean?"
I think the main thing one would want to do, as a professional epidemiologist, before trying to implementing any research, is make sure that the data actually supports the stated conclusions. To do that, one must examine the data. It's part of reading the literature properly.
Routine failure to do that, is part of why the medical literature is so prone to irreproducibility. Even the peer reviewers don't seem to be making sure that the data really warrants the conclusions.
Where did you get the idea that vetting the data is only desirable, if the results are "groundbreaking?" Results that support the status quo, may be even more suspect, because confirmation bias is likely to have influenced the authors' interpretation of their results.
I find that there are many published papers in the medical literature, where the provided data doesn't actually justify the stated conclusions. In those cases, providing yet more data wouldn't make any difference. That doesn't mean that better data transparency isn't a good idea - just that it's probablly not the biggest problem.
I hope you’re going somewhere. I don’t follow you.
[Non sequitur self-citation omitted]
Why did you feel it necessary to twice link to an unrelated set of (barely readable,* given that half of the window real estate is taken up by a content-free header and a "sign up" footer) comments aside from your appearing at the top of the list?
You appear to simply be insisting that the discussion should not be about the actual topic, i.e., a demand for data sharing that wasn't well thought out, and the remark that you pointed to doesn't even say anything meaningful about bibliometrics in the first place.
* Nice design touch:
A script on this page may be busy, or it may have stopped responding. You can stop the script now, or you can continue to see if the script will complete.
^ I should have prevented the autolinking, sorry.
The AVN over here in Aus used to always spruik that they had looked at the 'raw data' of studies and had come to different conclusions.
We at SAVN think they either meant the 'raw' VAERS data, or were sprouting bulldust.
Having had to go through actual raw data once when investigating how a research protocol had been undertaken at one of our hospitals as part of a university study all I can say is yikes. It's going to be interesting to say the least to see how the raw data is kept, let alone submitted, for some.
This kind of brouhaha makes me glad my work doesn’t involve anything that’s actually alive.
In a sense, it does. One of my lab-rat tasks as an undergrad was having to go through a Pioneer charged-particle instrument telemetry report every week to look for sensor flakiness. It was delivered from Ames as a two-inch-thick fanfold printout (presumably with a tape somewhere, some of it in BCD).
What does "all data underlying the findings described" even mean in this context? These are nontrivial reduction pipelines. Sure, everything generally is available nowadays, but the implication is something like "here are the Level 0 data (which we ourselves were not foolish enough to bother with) and some references to the instrument technical reports. Have a nice day."
Narad @33 -- That's quite a story!
I suppose the LHC is the ultimate garbage-data machine -- they automatically discard most of the data because they can tell on the fly that it's not interesting, and it is physically impossible to store the data fast enough with any available technology.
Given that, do you suppose anyone wants to second-guess the Higgs boson?
That’s quite a story!
Don't even get me started on patching the assembly language portions of the dE/dz code. For the 24 bit VOS platform. In retrospect, it was really a missed opportunity and part of the reason I moved away from the field – nobody bothered to explain why these nuts and bolts were novel, much less important (which they were; there were advances in position-sensing counters being made).
David Rabinowitz, of later Sedna fame, was a very cool and very supportive guy to have around, despite the fact that I wasn't assigned to him.
Narad @35 --
From the title I thought this might be a post about the ongoing mess that is "care.data" in the UK. Orac, I wonder if you've been following it?