Some Thoughts About STAP

By mikethemadbiologist on July 10, 2008.

Jonathan Eisen has a paper in PLoS One describing software that he's developed for analyzing 16S rRNA sequence data. Rather than walk through everything, I've decided this post will be different: I'm going to treat this as a manuscript that I'm reviewing (there will be some differences, and it won't be as formally written as a 'real' review). But I wanted to phrase some 'real' questions, as opposed to extensively distilling it for the 'lay' reader so non-scientists could see what we really criticize each other about (hint: it's not whether evolution is real). Onto the review.

Eisen has a good summary of what the program does:

...it describes automated software for analyzing rRNA sequences that are generated as part of microbial diversity studies. The main goal behind this was to keep up with the massive amounts of rRNA sequences we and others could generate in the lab and to develop a tool that would remove the need for "manual" work in analyzing rRNAs....
The basics of the software are summarized below: (see flow chart too).

Stage 1: Domain Analysis

Take a rRNA sequence

blast it against a database of representative rRNAs from all lines of life

use the blast results to help choose sequences to use to make a multiple sequence alignment

infer a phylogenetic tree from the alignment

assign the sequence to a domain of life (bacteria, archaea, eukaryotes)

Stage 2: First pass alignment and tree within domain

take the same rRNA sequence

blast against a database of rRNAs from within the domain of interest

use the blast results to help choose sequences for a multiple alignment

infer a phylogenetic tree from the alignment

assign the sequence to a taxonomic group

Stage 3: Second pass alignment and tree within domain

extract sequences from members of the putative taxonomic group (as well as some others to balance the diversity)

make a multiple sequence alignment

infer a phylogenetic tree

From the above path, we end up with an alignment, which is useful for things such as counting number of species in a sample as well as a tree which is useful for determining what types of organisms are in the sample.

I note - the key is that it is completely automated and can be run on a single machine or a cluster and produces comparable results to manual methods. In the long run we plan to connect this to other software and other labs develop to build a metagenomics and microbial diversity workflow that will help in the processing of massive amounts of sequence data for microbial diversity studies.

Before I get started, some disclosure: I am overseeing the development of this type of analytical pipeline for a major genome center. We are using some of the methods described here, some which aren't, and some we built ourselves (or heavily modified).

First, there are some good things:

The software can be implemented across a cluster, making it very fast even though it's computationally intensive.
It's phylogenetically based, and, at least in principle, should be more accurate than methods that look for sequence identity (sequence identity can evolve by convergent evolution, particular in 16S rRNA where the secondary structure of the rRNA--as opposed to the sequence itself--is also very important).
The program will provide a phylogeny as output (which can be used in other programs).

Now the questions and issues:

There are a lot of comparisons between STAP and BLASTN in terms of performance, but none with RDP. How does the performance of STAP compare to that of RDP?
STAP appears to have the most utility when using long 16S reads. With the advent of 454 technology, where reads will be between 200-400 bp, the conserved regions will be very short, and consequently have little phylogenetic signal. How well does the intermediate classification step work with short reads?
How robust is the classification to alignment methods (i.e., NAST, MUSCLE, KOFFEE, INFERNAL)? This issue could potentially become even worse with shorter reads. (k-mer based methods are 'alignment-immune').
A minor quibble but the RDP aligner is open source, and the clustering algorithm is simply DOTUR.
It's possible to retrain the RDP classifier. How easy is it to retrain STAP if one wants to use a different classification scheme?

I realize that most readers, other than Eisen (if he happens to stumble across this), will have no idea what I'm talking about. Don't worry, I'll return soon to calling idiots who desperately need my help fucking morons. But this post does have two uses. First, I do want to know the answers (consider this very-high order blogwhorring). Second, it just highlights how ridiculous creationists are, whether they be young-earth or intelligent design. As I've discussed previously, they're arguing about horse-drawn buggies while we're roaring past them in supersonic jets. They can't even understand basic biological concepts, never mind the stuff most biologists do on a daily basis*. They are partying like it's 1859.

*If any of the creationist Uruk-hai do stumble across this post, you know they'll argue that this is genetics and not TEH DARWINISMZ!!, which just shows how stupid they are.

More like this

BLASTing through the flu: activity 5, how similar is similar?

No more delays! BLAST away! Time to blast. Let's see what it means for sequences to be similar. First, we'll plan our experiment. When I think about digital biology experiments, I organize the steps in the following way:

Shotgun Sequencing a Eukaryotic Genome

Shotgun sequencing refers to the process whereby a genome is sequenced and assembled with no prior information regarding the genomic location of any of the DNA we sequence. There are quite a few steps that you have to go through before you have an assembled genome sequence.

Development and Role of the Human Reference Sequence in Personal Genomics

A few weeks back, we published a review about the development and role of the human reference genome. A key point of the reference genome is that it is not a single sequence.

More flu follies: comparing sequences and making trees, activity 4

What tells us that this new form of H1N1 is swine flu and not regular old human flu or avian flu? If we had a lab, we might use antibodies, but when you're a digital biologist, you use a computer.

Lots of things to comment on here

1. We compared it to blastn because that is the most commonly used method for people to do things in their own computational pipeline.

2. We did do some comparisons to various web tools like those available through RDP, GreenGenes, etc. Not all of these are in the paper but I will try to post on them here or on the PloS One site. The main goal was to make stuff that did not need to do thiings on web servers so really we wanted to compare to things you could install locally.

3. We have not used STAP for short read sequences. I think Mitch Sogin's approach to this is better than STAP. He basically compares the short reads to full lenngth sequences and then uses the information about the full length sequence to say what the short read might be.

4. As for the robustness to the alignment method --- this is an issue. My hope is that we can eventually use programs being developed by Sean Eddy to do the alignment and skip all of the other programs.

5. Not sure what you mean by point #4 minor quibble about DOTUR and RDP open source.

5. Easy to retrain STAP - you just use a different database or taxonomy list for the pre-aligned sequences

In the middle of a few things. Will post more later

Mike,

We will be posting some analysis data soon from using Maq to do alignments with NextGen data on our company blog - FinchTalk. I'll let you know when it's posted - that might be helpful when thinking about the best way to align short reads. A lot of people are using Maq for resequencing and SNP analysis.

One of the problems that I see with Eisen's paper is the lack of discussion concerning sequence quality. A great advancement in the human genome community, as you probably know, came about because phred could evaluate sequence quality and phrap was able to use sequence quality in assembling sequences.

Perhaps these analyses are happening upstream, but I don't see anything in the Eisen paper that discusses assemblies or quality scores. It seems that those steps would be an important part, early on in the pipeline.

Why do I think quality analyses are important? Because I have been looking at lots of traces from amplified 16S rRNA and I can find many cases (with either phred or KB) where bases have been miscalled - or there are polymorphisms because of the multiple copies of rRNA in the genome. I have some examples here - my examples are from plant genes - but I see the same kinds of things with bacterial rRNA sequences - plus lots of polymorphisms. Assembly tools, like phred, cross_match and phrap (even though they are not open source), that are aware of mixed bases, and can perform trimming and vector masking, would reduce early problems with miscalls and would be a useful addition to any kind of automated pipeline.

And of course, with NextGen sequencing, the whole workflow changes anyway.

Wow, less than an hour to have the author respond, the interweb being put to good use.

Infernal's almost ready to be slotted into rRNA analysis pipelines. 'Til recently it had two main problems -- too damned slow, and not very good at dealing with local alignment to small (454-sized) reads. Eric Nawrocki has pretty much addressed the first problem, and Eric and Diana Kolbe have developed good local alignment approaches for the second problem (which is more interesting that it may seem at first, because you're doing a local alignment to a *structure*, not a sequence -- but the 454 read may have thrown away parts of the sequence that you needed to see the base pairs). Infernal 1.0 is close to release -- we're in testing now. Release candidate rc1 is out at infernal.janelia.org, and rc2 (fixing some bugs) will be out in the next couple of days.

Eric's now working with a variety of rRNA folks to bring it up to production strength specifically for rRNA alignment and subsequent phylogenetic analysis.

Interestingly, we're finding that profile HMMs (sequence consensus only) are nearly as good as Infernal's profile SCFGs (seq + structure) for most rRNA alignment tasks; both outdo BLASTN alignment by a fair margin. It might ultimately make sense to use profile HMMs for all but the most phylogenetically weird rRNAs (where you need to peer at secondary structure to get the alignment right). Any kind of profile-based method ought to be a leap forward w/ respect to nearest neighbor pairwise alignment.

I for one welcome 'inside' discussions among our sequence analysis overlords. I think I actually generally understand everything you are talking, about despite not knowing the specific methods all those strings of capital letters represent.

Anyways, I wish you all the best of luck, because I really don't want to have to think too deeply about these algorithms ;) I'll shut up now and leave you to it.

thanks write...

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the Amazon What do accommodationists do about creationist politicians? I've Been Told You Can Get Flu From the Flu Shot: False! Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report" XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS! Coulter Goes All Science-y in Bid to Disprove Evolution Yet another bad day for the anti-vaccine movement 2011 Antibiotics: Killing…