The Future of Bacterial Genomics: It's Not the Sequencing, It's the...

...assembly and analysis. The Wellcome Trust has a very good (and mostly accurate) article about the 'next-gen' sequencing technologies. I'm going to focus on bacterial genomics because humans are boring (seriously, compared to two bacteria in the same species, once you've seen one human genome, you've seen them all).

Most of the time, when you read articles about sequencing, they focus on the actual production of raw sequence data (i.e., 'reads'). But that's not the rate-limiting step. That is, we have now reached the point where working with the data we generate is far more time-consuming.

Whole genomes don't come flying out of the sequencing machines: we have to take hundreds of thousands or millions of reads and stitch them together--what is known in genomics as assembly. It's pretty easy and fast to get a pretty good genome. By pretty good, I mean that most of the genome (~99%) is assembled into pieces 50,000 - 1,500,000 bases long*. Where the assemblers get hung up on with bacteria are repeated elements--regions of the genome that are virtually identical (they don't have to be completely identical, just close enough such that the assembler thinks they're identical reads with sequencing errors). Because the assembler can't figure out where to put these reads (they're all identical), it discards them--that's where the breaks occur*.

This is a problem because some of the most interesting genes, such as antibiotic resistance genes, are found sandwiched between repeated elements, known as insertion sequence elements ('IS elements'; IS elements are one of the major reasons resistance genes move from plasmid to plasmid--plasmids are mini-chromosomes that themselves can move from bacterium to bacterium--and from plasmid to chromosome). What this means is that we can assemble an antibiotic resistance gene (or genes) but we might not know if it's found on a plasmid or on the chromosome--that's a pretty critical biological question. To further complicate things, different plasmids can have the same IS elements, along with the bacterial chromosome. Not only will these introduce breaks into the assembly, but they can also lead to accidentally assembling plasmids together or incorrectly incorporating them into the genome.

Now, we do have methods to close up these gaps--this process is called finishing, and it involves either targeted sequencing or manually parsing through the existing data. But these are open-ended, slow processes (particularly the targeted sequencing). Worse, this involves thinking, and, relative to computer algorithms, thinking is very slow. This is also really expensive. So we can get a pretty good assembly, but I think a lot of people, thinking back to the Sanger sequencing days, when most bacterial genomes were closed, are going to have to understand that if you want a lot of genomes, they will be 'pretty good' assemblies, not closed, finished ones.

The other area is annotation: now that you have a bunch of sequences, you would like to know what genes are found on those sequences. This involves two things: identifying the open reading frame ('ORF') of the gene (that is, which nucleotides encode proteins), and then identifying what that open reading frame encodes (I'm making this sound like a two-step process; it's actually an iterative process, where each step informs the other).

Here too, we have automated gene callers which are very fast. Actually, many different gene calling methods. That's good! However, they will disagree with each about five to ten percent of the time. By disagree, I don't just mean that two different methods call the same exact region a different protein (e.g., an aldolase versus a dehydrogenase). We could cope with that for a lot of the downstream analyses we do, as long as we have identified the protein correctly**. The problem really arises when two different, overlapping regions of sequence are identified as ORFs (e.g., program A calls nucleotides 1-300 as a gene, and B calls nucleotides 13-360 as a gene). That is not good, because then a human has to go through the output manually and figure out what the actual ORF is (requiring more thinking which is slow and expensive). I would note that most major sequencing centers do manual annotation, but it is slow.

So, from a bacterial perspective, genome sequencing is really cheap and fast--in about a year, I conservatively estimate (very conservatively) that the cost of sequencing a bacterial genome could drop to about $1,500 (currently, commercial companies will do a high-quality draft for around $5,000- $6,000). We are entering an era where the time and money costs won't be focused on raw sequence generation, but on the informatics needed to build high-quality genomes with those data.

Interesting times.

*There are other technical reasons why breaks occur, but, to me, this is the worst offender.

More like this

In my professional experience, the issue with annotation getting the start codons right is so painful... If you look at the annotation for a particular organism at the original sequencing center, the annoation for it at NCBI, and the annotation generated for it at my place of employment you will see 3 different start codons.

One of the three even goes so far as to routinely use TTG as the start codon in a bacterium with 68% GC (I'm looking at you, NCBI).

This is why I keep trying to get people here interested in using RNA-seq not just for gene expression but to produce biologically validated annotation information. And by some miracle, we have a solexa run scheduled in 2 weeks that will include 4 bacterial samples assuming I can get them prepped.

Of course, we don't have a software pipeline to derive the annotation information from the output, and I'm a microbiologist and not a computer guy, so yeah, I'm open to suggestions and/or collaborations :p

Interesting post! I was at an annotation workshop earlier this summer and heard presentations on this subject by people from JGI, CSHL, and some other places. One of the tools they talked about (I forgot exactly who brought this up) and that we worked with, too, is optical mapping. Optical mapping works by fixing a molecule to a slide, and cutting with a restriction enzyme, then the DNA is stained and one can measure the distance between the cut sites. It's a great tool for looking at the physical size of a genome and comparing it to the predicted assembly so that you can have a reality check for your assembly program.

Great post Mike and thanks for the kind word about the Wellcome article. I had wanted to go into the difficulties in finishing (as well as data storage and analysis), but word limits restricted how much I was able to discuss. Very glad you've written this.

just a question, how much of the genome had to be assembled before they are being published?

Mike,

Nice article. Tempts me to finish that assembler & annotation tools I was working on! (It's a project that got pushed into the background, you know it goes.)

noyk,

In sense of being able to "publish" the data by placing it in the databases, any amount. In the sense of publishing it in the research literature, that depends on the "story" behind your sequencing effort (what it is that you are trying to find/show).

By BioinfoTools (not verified) on 08 Aug 2009 #permalink