On the Origin of New Exons

Nobel Intent has an excellent summary of a paper in the PNAS pipeline on the origin of new exons in the human genome. The authors compared genes between humans and seven other vertebrates to identify newly arisen exons. They found that many new exons are composed of repeat sequences, such as transposable elements. Also, recently evolved exons are more likely to be alternatively spliced, suggesting there is a "trial period" for a new exon before it can be fully incorporated into the protein coding sequence of a gene.

More like this

This is an interesting paper for a number of reasons. One of the things that fascinate me is the reliability of alternative splicing predictions. These are almost entirely based on EST data and that data is known to be flawed in too many ways to list here.

When you look closely at genes that have been intensively studied the alternative splicing predictions just don't make sense. This is especially true when the structure of the protein has been solved. That's why the most recent annotations of the genome ingnore the EST data.

Here's an example from my favorite genes: the HSP70 gene family. The ECgene database for human BiP (HSPA5) lists 14 splice variants (H9C10987). Many of them result in deletions and insertions of amino acid stretches in the hydrophobic core of the protein. This is one of the most highly conserved proteins in all of biology (the 650 amino acid residues of most mammals are almost identical). Does it make any sense that Homo sapiens would evolve new exons for inserting amino acids into the middle of the protein when no other species has them?

Of course it doesn't. That's why when you go to the EntrezGene entry for this gene (3309) you will see that none of the so-called alternative splice variants has been accepterd by the annotators in the latest release. It's very important that everyone understand what this decision means. It means that intelligent people (annotators) have correctly rejected all of the alternative splicing data for this gene. This so-called "data" is no different than the data for all other genes with so-called "alternative splice variants."

You see this same pattern in many well-studied genes. It leads to the conclusion that the alternative splicing databases are inaccurate for those genes that we know the most about. It strongly suggests that the entire database is flawed. The EST data is almost useless in predicting exons.

The Zhang & Chasin paper relies heavily on those databases to predict exons that have only "appeared" recently. If most of those exons were artifacts arising from the flawed EST data then we would expect the following ....

1. They would only be "included" in rare EST's. That's what the authors find.

2. They contain a high percentage of highly repetitive DNA resembling most of the junk DNA in the genome. That's what the authors find.

3. The nucleotide sequence of the predicted exons resembles that of non-coding DNA and differs considerably from the sequences of the surrounding true exons. This is exactly what the authors find.

The authors do not question the validity of the EST databases in predicting alternative splicing and new exons but there are some papers that do. It's time we started to pay attention. If the databases are wrong then papers like this one are completely useless. They don't tell us a damn thing about the evolution of exons because those new exons are artifacts.

By Larry Moran (not verified) on 08 Sep 2006 #permalink

Great points. A friend of mine has been searching the databases to identify examples of TEs inserted into protein coding genes. He won't accept a gene into his data set unless the protein (not DNA or RNA) has been sequenced and shown to contain the TE casette. Judging by the literature in this field, you'd think he'd be able to find tons and tons of these examples.

To date, I don't think he has identified a single one (or maybe one or two). Granted, protein sequence databases are far more sparse than genomic or EST sequence databases. But it appears that many (or most, or possibly all) examples of TE's inserted into protein coding genes are false positives.