Some Thoughts About STAP

Jonathan Eisen has a paper in PLoS One describing software that he's developed for analyzing 16S rRNA sequence data. Rather than walk through everything, I've decided this post will be different: I'm going to treat this as a manuscript that I'm reviewing (there will be some differences, and it won't be as formally written as a 'real' review). But I wanted to phrase some 'real' questions, as opposed to extensively distilling it for the 'lay' reader so non-scientists could see what we really criticize each other about (hint: it's not whether evolution is real). Onto the review.

Eisen has a good summary of what the program does:

...it describes automated software for analyzing rRNA sequences that are generated as part of microbial diversity studies. The main goal behind this was to keep up with the massive amounts of rRNA sequences we and others could generate in the lab and to develop a tool that would remove the need for "manual" work in analyzing rRNAs....

The basics of the software are summarized below: (see flow chart too).


  • Stage 1: Domain Analysis
    • Take a rRNA sequence
    • blast it against a database of representative rRNAs from all lines of life
    • use the blast results to help choose sequences to use to make a multiple sequence alignment
    • infer a phylogenetic tree from the alignment
    • assign the sequence to a domain of life (bacteria, archaea, eukaryotes)
  • Stage 2: First pass alignment and tree within domain
    • take the same rRNA sequence
    • blast against a database of rRNAs from within the domain of interest
    • use the blast results to help choose sequences for a multiple alignment
    • infer a phylogenetic tree from the alignment
    • assign the sequence to a taxonomic group
  • Stage 3: Second pass alignment and tree within domain
    • extract sequences from members of the putative taxonomic group (as well as some others to balance the diversity)
    • make a multiple sequence alignment
    • infer a phylogenetic tree

From the above path, we end up with an alignment, which is useful for things such as counting number of species in a sample as well as a tree which is useful for determining what types of organisms are in the sample.


I note - the key is that it is completely automated and can be run on a single machine or a cluster and produces comparable results to manual methods. In the long run we plan to connect this to other software and other labs develop to build a metagenomics and microbial diversity workflow that will help in the processing of massive amounts of sequence data for microbial diversity studies.

Before I get started, some disclosure: I am overseeing the development of this type of analytical pipeline for a major genome center. We are using some of the methods described here, some which aren't, and some we built ourselves (or heavily modified).

First, there are some good things:

  1. The software can be implemented across a cluster, making it very fast even though it's computationally intensive.
  2. It's phylogenetically based, and, at least in principle, should be more accurate than methods that look for sequence identity (sequence identity can evolve by convergent evolution, particular in 16S rRNA where the secondary structure of the rRNA--as opposed to the sequence itself--is also very important).
  3. The program will provide a phylogeny as output (which can be used in other programs).

Now the questions and issues:

  1. There are a lot of comparisons between STAP and BLASTN in terms of performance, but none with RDP. How does the performance of STAP compare to that of RDP?
  2. STAP appears to have the most utility when using long 16S reads. With the advent of 454 technology, where reads will be between 200-400 bp, the conserved regions will be very short, and consequently have little phylogenetic signal. How well does the intermediate classification step work with short reads?
  3. How robust is the classification to alignment methods (i.e., NAST, MUSCLE, KOFFEE, INFERNAL)? This issue could potentially become even worse with shorter reads. (k-mer based methods are 'alignment-immune').
  4. A minor quibble but the RDP aligner is open source, and the clustering algorithm is simply DOTUR.
  5. It's possible to retrain the RDP classifier. How easy is it to retrain STAP if one wants to use a different classification scheme?

I realize that most readers, other than Eisen (if he happens to stumble across this), will have no idea what I'm talking about. Don't worry, I'll return soon to calling idiots who desperately need my help fucking morons. But this post does have two uses. First, I do want to know the answers (consider this very-high order blogwhorring). Second, it just highlights how ridiculous creationists are, whether they be young-earth or intelligent design. As I've discussed previously, they're arguing about horse-drawn buggies while we're roaring past them in supersonic jets. They can't even understand basic biological concepts, never mind the stuff most biologists do on a daily basis*. They are partying like it's 1859.

*If any of the creationist Uruk-hai do stumble across this post, you know they'll argue that this is genetics and not TEH DARWINISMZ!!, which just shows how stupid they are.

More like this

Lots of things to comment on here

1. We compared it to blastn because that is the most commonly used method for people to do things in their own computational pipeline.

2. We did do some comparisons to various web tools like those available through RDP, GreenGenes, etc. Not all of these are in the paper but I will try to post on them here or on the PloS One site. The main goal was to make stuff that did not need to do thiings on web servers so really we wanted to compare to things you could install locally.

3. We have not used STAP for short read sequences. I think Mitch Sogin's approach to this is better than STAP. He basically compares the short reads to full lenngth sequences and then uses the information about the full length sequence to say what the short read might be.

4. As for the robustness to the alignment method --- this is an issue. My hope is that we can eventually use programs being developed by Sean Eddy to do the alignment and skip all of the other programs.

5. Not sure what you mean by point #4 minor quibble about DOTUR and RDP open source.

5. Easy to retrain STAP - you just use a different database or taxonomy list for the pre-aligned sequences

In the middle of a few things. Will post more later

Mike,

We will be posting some analysis data soon from using Maq to do alignments with NextGen data on our company blog - FinchTalk. I'll let you know when it's posted - that might be helpful when thinking about the best way to align short reads. A lot of people are using Maq for resequencing and SNP analysis.

One of the problems that I see with Eisen's paper is the lack of discussion concerning sequence quality. A great advancement in the human genome community, as you probably know, came about because phred could evaluate sequence quality and phrap was able to use sequence quality in assembling sequences.

Perhaps these analyses are happening upstream, but I don't see anything in the Eisen paper that discusses assemblies or quality scores. It seems that those steps would be an important part, early on in the pipeline.

Why do I think quality analyses are important? Because I have been looking at lots of traces from amplified 16S rRNA and I can find many cases (with either phred or KB) where bases have been miscalled - or there are polymorphisms because of the multiple copies of rRNA in the genome. I have some examples here - my examples are from plant genes - but I see the same kinds of things with bacterial rRNA sequences - plus lots of polymorphisms. Assembly tools, like phred, cross_match and phrap (even though they are not open source), that are aware of mixed bases, and can perform trimming and vector masking, would reduce early problems with miscalls and would be a useful addition to any kind of automated pipeline.

And of course, with NextGen sequencing, the whole workflow changes anyway.

Infernal's almost ready to be slotted into rRNA analysis pipelines. 'Til recently it had two main problems -- too damned slow, and not very good at dealing with local alignment to small (454-sized) reads. Eric Nawrocki has pretty much addressed the first problem, and Eric and Diana Kolbe have developed good local alignment approaches for the second problem (which is more interesting that it may seem at first, because you're doing a local alignment to a *structure*, not a sequence -- but the 454 read may have thrown away parts of the sequence that you needed to see the base pairs). Infernal 1.0 is close to release -- we're in testing now. Release candidate rc1 is out at infernal.janelia.org, and rc2 (fixing some bugs) will be out in the next couple of days.

Eric's now working with a variety of rRNA folks to bring it up to production strength specifically for rRNA alignment and subsequent phylogenetic analysis.

Interestingly, we're finding that profile HMMs (sequence consensus only) are nearly as good as Infernal's profile SCFGs (seq + structure) for most rRNA alignment tasks; both outdo BLASTN alignment by a fair margin. It might ultimately make sense to use profile HMMs for all but the most phylogenetically weird rRNAs (where you need to peer at secondary structure to get the alignment right). Any kind of profile-based method ought to be a leap forward w/ respect to nearest neighbor pairwise alignment.

I for one welcome 'inside' discussions among our sequence analysis overlords. I think I actually generally understand everything you are talking, about despite not knowing the specific methods all those strings of capital letters represent.

Anyways, I wish you all the best of luck, because I really don't want to have to think too deeply about these algorithms ;) I'll shut up now and leave you to it.