Where are the curators? Why did they miss this? What is the responsibility of the community anway?

In a recent post, I wrote about an article that I read in Science magazine on the genetics of learning.

One of things about the article that surprised me quite a bit was a mistake the authors made in placing the polymorphism in the wrong gene. I wrote about that yesterday. The other thing that surprised me was something that I found at the NCBI.

The article that I wrote about definitely made a mistake and I don't understand why it wasn't caught by the reviewers. I found it pretty quickly by searching OMIM and I was only trying to find information about dopamine, not verify results. Anyway, the authors of the paper assumed that the polymorphism they were studying mapped to the dopamine D2 receptor. They were wrong.

The disappointing and more surprising finding was the extent of wrong information that I found in the Gene database at the NCBI. I thought the Gene database was curated, not necessarily by the NCBI staff, but at least by the community.

Well, if the community is curating the information, they do a very poor job. I found about 31 papers in the GeneRIF section with- what I think is- the wrong information.

All of these papers concern an allele (the TaqI A allele) that many people thought mapped to the Dopamine D2 receptor gene. According to a 2004 paper, by Neville, Johnstone, and Walton; it does not.

Neville, Johnstone, and Walton claim that this polymorphism maps to a different gene, but there are still lots of other papers cited in the GeneRIFs that discuss the TaqI A allele of the dopamine D2 receptor. A quick look at the instructions shows that the problem bit of information, the GeneRIF, or Gene Reference Into Function, may have been entered by the authors themselves or at least someone else who read those papers.

I understand finding incorrect information in PubMed. We all know that science is a process and it's wise to be skeptical when you see the first publication on a topic. But, somehow I had the impression that the Gene database was curated and that the information in the Gene database had been verified. I can also see that the NCBI has a mechanism for the scientific community to correct the information that's been entered incorrectly.

Apparently no one has gone back and corrected this and now that I'm looking at the page for submitting corrections and realizing the extent of the problem, I'm not so sure that I want to spend the time on this either.

I don't even work in this area. Unlike the researchers who submitted these references and were paid for their work and the journals who were reviewed the work and were paid by subscribers, I'm just an innocent bystander. And I can see that 31 citations will take a bit of time to correct and I wish I hadn't even found this problem because now I'm conflicted between a possible obligation to a larger community, who will neither value my contribution nor be pleased with my input, and the temptation to forget I ever saw the problem in the first place.

Damn.

UPDATE: here's a picture from the human DRD2 record in the Gene database showing some of the incorrect references. These are misleading because a naive reader would miss that the correct gene location was identified 4 years ago and would think that this polymorphism maps in the DRD2 gene when it doesn't.

i-f0fd6b048691bef24321a147d51f02e2-grif_sm.gif

More like this

GenBank is not exactly curated. As far as I know, the NCBI staff make sure everything is in the proper format, but that's about it.

There are several papers which talk about the vast number of chimeric 16S DNA sequences which plague GenBank. A recent paper (2006) by Kevin Ashelford (doi:10.1128/AEM.00556-06) even demonstrated that some libraries entered into GenBank had chimeric rates of over 45%! That's almost half of the submitted sequences being PCR artifactual errors! Atrocious! Yet, there they sit in GenBank ... even after being spotted and reported on.

When I work with 16S data, I often BLAST against the GenBank database for the best hits. But I always run my data against the Ribosomal Database housed by Dr. Tiedje at MSU (https://rdp.cme.msu.edu/). They have a curated database, as well as providing access for online chimera detection.

Even the GenBank non-redundant (nr) library isn't exactly non-redundant anymore.

According to a 2005 Nucleic Acids Research paper (doi: 10.1093/nar/gki031) on the Gene database, the Gene database obtains part of its information from GenBank.

Quote:Data in Entrez Gene result from a mixture of curation and automated analyses. Annotation in sequences from NCBI's Reference sequence project (2) or the International Nucleotide Sequence Database Collaboration (DDBJ, EMBL, GenBank) (3) is integrated with information from collaborating model organism databases, literature review (especially the Gene References into Function or GeneRIFs) (1), and public users, with curation by RefSeq staff as required.

Which would make me think that errors propagated in one, can easily carry over into the other. Overall, I'm just not particularly impressed with the NCBI staff and how they handle their databases. I know with the advent of high-throughput DNA sequencing, they got swamped and caught unawares ... but they've never caught up and it doesn't look like they've made any significant steps to catch up and/or go back and revisit the errors that occurred in the lapse.

TomJoe - You missed the point. I wasn't complaining about the NCBI.

I think the reviewers from Science should have caught the original mistake and that the researchers who are working in this field should correct the GeneRIF citations.

Hello,

The polymorphism is indeed located in the ANKK1 gene, which is upstream of DRD2. But references 6-8 in Klein et al. (2007) are cited to show that the polymorphism seems to affect DRD2 expression levels, so it is fair to say that the polymorphism may be in a DRD2 regulatory region, or is tightly linked to one that is.

I haven't been able to check this, but it's possible that the polymorphism was discovered back in RFLP days, and ANKK1 might not have been identified yet. Since the polymorphism had an effect on DRD2 levels, it was called DRD2-TaqIA in the literature--and there's a lot of small association studies on DRD2-TaqIA and behavioral traits. As with much nomenclature in genetics, the old legacies are hard to change.

A similar case is the SNP associated with lactose intolerance (lactase persistence), which is thought to be located in an enhancer for the LCT gene (see Lewinsky et al. 2005), but whose position technically lies within the MCM6 gene. But everyone still refers to it as -13910 relative to LCT, rather than its position relative to MCM6.

So this all goes back to the debate on what exactly a "gene" is: just the open reading frame? The exons + introns? The cis-regulatory regions? Maybe GenBank needs a few more fields.

Best,
Andro Hsu

(Full disclosure: I work for 23andMe, which has entries on both associations mentioned here.)

How do we know they're wrong and not just praciticing "denialism?

I concur with Andro, and with Peter on the previous story. Your correction is valuable, but this isn't nearly as egregious as you're making it out to be.

In fact, I just received a weekly PubMed search result and see this...

J: That's funny -

I think the seriousness of the mis-mapping depends on whether or not the researchers who work in this field make assumptions about the effects of the TaqI polymorphism based the idea that the mutation maps in DRD2.

My question here is how do errors these get fixed? and who should do it?

Some time ago I worked on compiling a database of drug-target interactions (Matador, http://matador.embl.de). We relied on PubMed. In case there was uncorrected information in PubMed, we will have picked it up (together with a link to the original article). And now this information will propagate through other databases, and it'll live on in the minds of researchers... I think it's almost futile to try to fix it: There are just too many places where you can't correct an error (that may or may not be significant).

Sandra,

You have a good point, but if convincing functional evidence connects a polymorphism in a neighboring gene with expression levels of a gene downstream, then I think the harm of conflating the two is minimized. The converse assumption--that a polymorphism mapping to ANKK1 would have no effect on DRD2--might actually have stymied research in this field.

Researchers identifying novel associations won't be able to rely just on the mapping--they'll have to look at the surrounding region/genes because SNPs may tag the actual causal variant in a number of neighboring genes spanning large LD blocks. So using a tool like the UCSC Genome Browser will become more important than relying on a single (possibly error-prone) curation.

I also have a feeling that nomenclature like DRD2-TaqIA starts to be used as a shorthand for what scientists "in the know" take for granted. Papers like this one make clear that the entire region has significant associations, although to get at mechanism, focus will probably return to DRD2. I know that this isn't helpful for someone (like me) who has to jump into many new sub-fields--the HLA region is particularly daunting for a non-immunologist.

Andro

Hello,
I work at NCBI on the RefSeq and Entrez Gene databases. Nomenclature issues are a known problem for database curators; it can be challenging to sort out 'which gene' when the same terminology has been used historically for different genes, or different sub-regions.

We have added a comment to both Gene records warning of the potential confusion. We are currently reviewing the literature associations. Do you feel that it is useful (for anyone) to find these GeneRIFs associated with the DRD2 gene? If simply moved to the ANKK1 gene, other scientists will also have a perception of erroneous associations because the publications themselves refer to the gene study focus as DRD2. An alternate approach is to associate the set of GeneRIFs in questions with both loci.

By Kim Pruitt (not verified) on 24 Jun 2008 #permalink

Hi Kim,

I like your ideas of associating the articles with both records AND as you mentioned, flagging the DRD2 entries to indicate that later information shows the polymorphism maps somewhere else.

Hi Andro: There are several similar cases of SNPs affecting neighboring genes, right? I did some searching through OMIM with some help from gawk/perl and OpenHelix and found the genes DRD2/ANKK1, LCT/MCM6 that you mentioned, and also ESR1/C60ORF97, OCA/HERC and something in pmid 12176321 about a whole locus.

I'm searching for more, similar cases. I wonder how many there are. I cannot find a database which would connect the _real_ target gene with these SNPs for a comprehensive list. Does the 23andme database show these associations? Is it possible to have a look on your database? I am wondering as well about how "closed" is the 23andme database, as I guess it might be better curated than the NCBI databases...?

(version with link corrected)
Hi Andro: There are several similar cases of SNPs affecting neighboring genes, right? I did some searching through OMIM with some help from gawk/perl and OpenHelix and found the genes DRD2/ANKK1, LCT/MCM6 that you mentioned, and also ESR1/C60ORF97, OCA/HERC and something in pmid 12176321 about a whole locus.

I'm searching for more, similar cases. I wonder how many there are. I cannot find a database which would connect the _real_ target gene with these SNPs for a comprehensive list. Does the 23andme database show these associations? Is it possible to have a look on your database? I am wondering as well about how "closed" is the 23andme database, as I guess it might be better curated than the NCBI databases...?