One of the things that drives me crazy on occasion is nomenclature. Well, maybe not just nomenclature, it's really the continual changes in the nomenclature, and the time it takes for those changes to ripple through various databases and get reconciled with other kinds of information. And the realization that sometimes this reconciliation may never happen.
One of the projects that I've been working on during the past couple of years has involved developing educational materials that use bioinformatics tools to look at the isozymes that metabolize alcohol. As part of this project, I've been collecting 3 dimensional structures of the enzymes and annotating polymorphic amino acids. That part is very straightforward. The complicated part of the project is figuring out how the structures correspond to the genes, genetic data, and association studies with diverse polymorphisms.
In retrospect, having sorted through all these things, it seems like compiling that information should be straightforward, too. But in practice, as someone who put all the information together by reading accessible papers and searching databases, it's not. I ran up against several confusing moments where I ended up banging my head against the wall trying to sort out which polymorphism correlates with which structural change, what it was called 8 years ago, what it's called now, and how the changes are tied to the structures.
The human versions of the alcohol and aldehyde dehydrogenase genes stand out as an example of a gene family whose members have all had multiple names and confusing polymorphisms. At one time, it seems that there were seven human genes, named ADH1-7. About eight years ago, the names all changed. The genes that were formerly called ADH1, ADH2, ADH3, and ADH4-7; became ADH1A, ADH1B, ADH1C, and ADH4-7. And the isozymes that were named alpha, beta, gamma, chi, mu, sigma, and phi changed names also. Even though I could find some of this information was in the Entrez Gene database, it still took work to figure out how those names were tied to the genes now, especially when the 3D structures have names like Human Alcohol Dehydrogenase Beta-1- Beta-1 Isoform or ADH chi chi. Naturally, the structure database entries don't tell you anything about the most recent name of the gene.
Is beta 1 the same thing as the beta polypeptide? Does this mean there's a beta 2?
Even more confusing, many of the polymorphisms have names like ADH2*1, ADH2*2, ADH2*3. That would be fine, but these names aren't in dbSNP. There's nothing in dbSNP that directly ties the SNPs to these polymorphisms. Nor are these names used consistently in the literature. It seems like every paper that tries to find an association between a phenotype and a genotype uses a different name for the genetic variations they're genotyping.
Even worse, some of the places where you might expect to find current information are out of date. I read through some of the publications at the National Institute on Alcohol Abuse and Alcoholism site including the strategic plan, naively thinking that the current names might be there.
There was a nice table based on papers from 1998 and 2002, but the table had all old names for the polymorphisms and none of the current names or SNP references. It was also missing all the odd Greek subunit names that were assigned to various isozymes like beta, alpha, sigma, and chi.
At least this table had a date, though. Part of what made it hard to put the new and old information together was not knowing which information was current and which was not. Making a spread sheet to keep track didn't get easier until I found a paper with a kind ADH rosetta stone saying that the names changed in 2001 and how.
OMIM was the best database for figuring this stuff out, once I knew the current names of all the ADH genes. Unlike some of the other databases, I was able to look at OMIM and see when the last updates had happened. In this case, sadly, the last updates to the ADH genes were made in October 2007 by the late Victor McKusik. I knew I could trust those.
OMIM was still confusing though, since the entries would often use multiple names, in successive paragraphs, to refer to the same SNP, without giving any warning to the reader. In the ADH1B reference for example, one paragraph calls a polymorphism the typical and atypical forms of ADH, ADH2*1 and ADH2*2, the next paragraph calls it ADH1B*47his, and the next paragraph calls it ADH1B arg47-to-his polymorphism (rs1229984).
It took some time searching several databases, and filling in tables, but I finally did sort out the connections between the old names of the isozymes, SNPs, and genes, and the new names. I even tried using some other tools, like NextBio, for doing literature searches only to have them tell me that ADH is vasopressin and not give me any information on the genes I wanted.
What can I conclude from this activity?
There's still a place in the world for people (like me) who actually read papers instead of simply trusting database records.
The question that I'm left with, is how to articulate what I did and how to describe the most efficient path so that I can teach this sort of thing to students. I battled through the databases and conflicting names and sorted them all out because I'm motivated and don't mind reading the papers. I worry that most undergraduates will just get disgusted and give up.
Some people will say that the research I did in sorting out the genes, proteins, and mutations, doesn't belong in a bioinformatics course. They'll say it's not really bioinformatics at all even though it involves trying to reconcile biological information from multiple databases.
But if it's not bioinformatics, what is it? Reading? Annotating?
Others will say that sorting out this kind of stuff is a trivial problem. All the information was in the database right?
Well, if this kind of work is trivial, why would it take someone with a Ph.D. and several years of experience, three days to figure this out?
Others will say this kind of work something we should bother trying to teach at all. We just need better search tools, right? Won't the semantic web solve all of this?
No, I don't think reconciling database records and nomenclature is a solved problem.
As far as bioinformatics goes, I think combining the information from genes, proteins, polymorphisms, structures, and genetics is the hardest thing to do and the absolutely hardest thing to teach.
If it were easy, there would be a database that did it.
Well, at least you had only to comb through the gene names for one organism. The fun starts when you have to keep track of homologs in other species. It is really a chore, and I am sure lots of interesting connections get missed because the same thing is called different names in different publications/database.
Just as an example, last week a high throughput two hybrid study for C.e. proteins was published. Since they had a search engine, I thought I might take a look and see if there was something about my favorite complex (from H.s.). It took me 2 hours just to decide which name I had to input in the search box for each of my proteins. And, since I have not found anything, I still don't know whether there is nothing about my proteins or whether I simply looked with the wrong names.
Why can't we have a 'synonim' table database that keeps track of these things? It is probably a difficult thing to setup, but hardly an unneeded one...
Normally I search under gene for Entrez, there is usually a lot of the synonyms listed under each gene. Then there is an easy link to the RefSeq mRNA and protein entries. I'd like to hear if anyone has an easier solution
Andrea - you might try the Homologene database. It does pretty well at listing orthologues.
Dave: You're right, you can get the synonyms for the gene names from Entrez Gene- but not always the proteins. What makes it confusing, is that it's not easy to tell which synonyms were used when and which are current. For example, the entry for ADH6, lists ADH-5 as a synonym. That's confusing, and ADH5 and ADH6 are different genes. It turns out that ADH-5 was a synonym in 2001, but it's not one now. We need some kind of information about versioning history.
I will take a look at Homologene, thanks.
@Andrea M- I've set up databases to track synonym and homologs for various companies I've worked for. It is not hard to set up them up- the DB structure can be pretty straightforward- but it is time consuming to populate them with data and really hard to maintain the momentum to keep them up to date. Some of the recent experiments with wikis for proteins or genes may help, by giving the community a really easy way to help keep the data up to date, but it is too early to say whether these initiatives will succeed.
Cloud knows what she's talking about :). That said, it's kind irritating that there are no resources that can be used as a service to query against. Even for a wiki project to be successful you need an API for programmatic access
I have been working a bit with polymorphisms and as a electronics/software engineer I find it difficult to navigate the nomenclature. I have found that Wikipedia is a quite nice tool to keep track of things, so I have started a number of articles on SNPs and I have also constructed a template, see, e.g., http://en.wikipedia.org/wiki/Rs6311
The template allows me to type in the 'rs' identifier, naming variations and a few other identifiers. Based on this structured information in the Wikipedia template I can automatically construct an overview of all the structured SNPs in Wikipedia
There are now many articles for genes in Wikipedia thanks to a bot that was recently described in PLoS Biology. It is a puzzle to me why so few researchers are not editing more in Wikipedia. I have now expanded http://en.wikipedia.org/wiki/ADH1B a bit based in part on the information in the blog. I hope it is correct? I have also redirected the old gene name: http://en.wikipedia.org/wiki/ADH2 leads to ADH1B.
I took a look at this but I'm not sure what advantage this offers over OMIM, the Gene database, and dbSNP - besides being able to edit it. I can see from this page, that I would end up consulting all of the other databases anyway in order to see where anyone SNP maps relative to the rest of the gene and other SNPs. What is the advantage here? What problem does it solve?
OMIM and dbSNP provide much better authority for the basic information. Wikipedia works better as an aggregator and an easy notebook. As an example: It is not immediately apparent to me what "T102C" is. I can search across the NCBI database and see that the OMIM has an entry on "HTR2A" and go down to the "allelic variants" section and find "102T-C" where it lists two publications. The NCBI Gene lists publications, but none of them seems to tell me that it is also refered to as "rs6313". Furthermore, there are many more publications than the two in OMIM. Also the OMIM database seems not to be as updated as the SzGene database. SzGene gives very little---if any---evidence that the SNP is associated with schizophrenia (looking at the confidence intervals), while reading the over ten old studies in OMIM makes the impression that there is. Information like this can be added to a relevant Wikipedia article: http://en.wikipedia.org/wiki/Rs6313
It does, however, require that there is someone that looks up the information in the first place and combines the information from PubMed, Google Scholar, dbSNP, SzGene, ...
Finn: I agree, it would be very nice if publications were correlated with the SNP IDs and if OMIM were updated more frequently. I suspect that OMIM updates are infrequent because they're done by hand and not by computer.
I can see using Wikipedia as a kind of personal notebook. I don't think it would be useful for me though, since I'd have to spend too much time verifying the information and making maps.
It's not very helpful to me to think of SNPs and other variations as independent entities unless I can identify coding variations or things that affect the protein or structure. It makes more sense to me to view SNPs in the context of the rest of the gene. That's why I really like the dbSNP Gene View reports like this one for HTR2A, here.
The table lets me know whether this is a coding or non-coding variation and helps me figure out which strand is which and where the change takes place. Eventually, I imagine that OMIM links will appear in the Clinically Assocaiated column. I would also want to know if other SNPs were in linkage disequilibrium.
Yes, I see that dbSNP Gene View gives a nice overview. Thanks for the link.
This thread is of great interest to me, as I am a writer/editor at OMIM and the ADH genes are beyond confusing. We have recently updated the OMIM entries, and it seems pretty clear that they have effects on susceptibility to alcohol. Yes, OMIM Is done by hand and we are very diligent, yet it is difficult to keep up with the science. The problem with ADH nomenclature originated in the primary literature and mainly due to the lack of consensus for appropriate designations. I sincerely hope our recent update is accurate. Please let us know if you have any insight.
I did put a map together eventually and I'll take a look at the new OMIM entry, too.
With over 30K genes x lots of creatures, I understand how getting everything in synch is a challenge.