The Future of Bacterial Genomics: It's Not the Sequencing, It's the...

By mikethemadbiologist on August 6, 2009.

...assembly and analysis. The Wellcome Trust has a very good (and mostly accurate) article about the 'next-gen' sequencing technologies. I'm going to focus on bacterial genomics because humans are boring (seriously, compared to two bacteria in the same species, once you've seen one human genome, you've seen them all).

Most of the time, when you read articles about sequencing, they focus on the actual production of raw sequence data (i.e., 'reads'). But that's not the rate-limiting step. That is, we have now reached the point where working with the data we generate is far more time-consuming.

Whole genomes don't come flying out of the sequencing machines: we have to take hundreds of thousands or millions of reads and stitch them together--what is known in genomics as assembly. It's pretty easy and fast to get a pretty good genome. By pretty good, I mean that most of the genome (~99%) is assembled into pieces 50,000 - 1,500,000 bases long*. Where the assemblers get hung up on with bacteria are repeated elements--regions of the genome that are virtually identical (they don't have to be completely identical, just close enough such that the assembler thinks they're identical reads with sequencing errors). Because the assembler can't figure out where to put these reads (they're all identical), it discards them--that's where the breaks occur*.

This is a problem because some of the most interesting genes, such as antibiotic resistance genes, are found sandwiched between repeated elements, known as insertion sequence elements ('IS elements'; IS elements are one of the major reasons resistance genes move from plasmid to plasmid--plasmids are mini-chromosomes that themselves can move from bacterium to bacterium--and from plasmid to chromosome). What this means is that we can assemble an antibiotic resistance gene (or genes) but we might not know if it's found on a plasmid or on the chromosome--that's a pretty critical biological question. To further complicate things, different plasmids can have the same IS elements, along with the bacterial chromosome. Not only will these introduce breaks into the assembly, but they can also lead to accidentally assembling plasmids together or incorrectly incorporating them into the genome.

Now, we do have methods to close up these gaps--this process is called finishing, and it involves either targeted sequencing or manually parsing through the existing data. But these are open-ended, slow processes (particularly the targeted sequencing). Worse, this involves thinking, and, relative to computer algorithms, thinking is very slow. This is also really expensive. So we can get a pretty good assembly, but I think a lot of people, thinking back to the Sanger sequencing days, when most bacterial genomes were closed, are going to have to understand that if you want a lot of genomes, they will be 'pretty good' assemblies, not closed, finished ones.

The other area is annotation: now that you have a bunch of sequences, you would like to know what genes are found on those sequences. This involves two things: identifying the open reading frame ('ORF') of the gene (that is, which nucleotides encode proteins), and then identifying what that open reading frame encodes (I'm making this sound like a two-step process; it's actually an iterative process, where each step informs the other).

Here too, we have automated gene callers which are very fast. Actually, many different gene calling methods. That's good! However, they will disagree with each about five to ten percent of the time. By disagree, I don't just mean that two different methods call the same exact region a different protein (e.g., an aldolase versus a dehydrogenase). We could cope with that for a lot of the downstream analyses we do, as long as we have identified the protein correctly**. The problem really arises when two different, overlapping regions of sequence are identified as ORFs (e.g., program A calls nucleotides 1-300 as a gene, and B calls nucleotides 13-360 as a gene). That is not good, because then a human has to go through the output manually and figure out what the actual ORF is (requiring more thinking which is slow and expensive). I would note that most major sequencing centers do manual annotation, but it is slow.

So, from a bacterial perspective, genome sequencing is really cheap and fast--in about a year, I conservatively estimate (very conservatively) that the cost of sequencing a bacterial genome could drop to about $1,500 (currently, commercial companies will do a high-quality draft for around $5,000- $6,000). We are entering an era where the time and money costs won't be focused on raw sequence generation, but on the informatics needed to build high-quality genomes with those data.

Interesting times.

*There are other technical reasons why breaks occur, but, to me, this is the worst offender.

More like this

BLASTing through the flu: activity 5, how similar is similar?

No more delays! BLAST away! Time to blast. Let's see what it means for sequences to be similar. First, we'll plan our experiment. When I think about digital biology experiments, I organize the steps in the following way:

Shotgun Sequencing a Eukaryotic Genome

Shotgun sequencing refers to the process whereby a genome is sequenced and assembled with no prior information regarding the genomic location of any of the DNA we sequence. There are quite a few steps that you have to go through before you have an assembled genome sequence.

Development and Role of the Human Reference Sequence in Personal Genomics

A few weeks back, we published a review about the development and role of the human reference genome. A key point of the reference genome is that it is not a single sequence.

More flu follies: comparing sequences and making trees, activity 4

What tells us that this new form of H1N1 is swine flu and not regular old human flu or avian flu? If we had a lab, we might use antibodies, but when you're a digital biologist, you use a computer.

gstpmgit avrhnnnj wberesaw
Ø´Ø§Øª ØµÙØªÙ
Ø¯Ø±Ø¯Ø´Ø©
Ø´Ø§Øª
Ø´Ø§Øª Ø³Ø¹ÙØ¯Ù

In my professional experience, the issue with annotation getting the start codons right is so painful... If you look at the annotation for a particular organism at the original sequencing center, the annoation for it at NCBI, and the annotation generated for it at my place of employment you will see 3 different start codons.

One of the three even goes so far as to routinely use TTG as the start codon in a bacterium with 68% GC (I'm looking at you, NCBI).

This is why I keep trying to get people here interested in using RNA-seq not just for gene expression but to produce biologically validated annotation information. And by some miracle, we have a solexa run scheduled in 2 weeks that will include 4 bacterial samples assuming I can get them prepped.

Of course, we don't have a software pipeline to derive the annotation information from the output, and I'm a microbiologist and not a computer guy, so yeah, I'm open to suggestions and/or collaborations :p

I think the JGI (and maybe the other sequencing centers too, I don't know) has mostly automated finishing for bacterial genomes.

Interesting post! I was at an annotation workshop earlier this summer and heard presentations on this subject by people from JGI, CSHL, and some other places. One of the tools they talked about (I forgot exactly who brought this up) and that we worked with, too, is optical mapping. Optical mapping works by fixing a molecule to a slide, and cutting with a restriction enzyme, then the DNA is stained and one can measure the distance between the cut sites. It's a great tool for looking at the physical size of a genome and comparing it to the predicted assembly so that you can have a reality check for your assembly program.

Great post Mike and thanks for the kind word about the Wellcome article. I had wanted to go into the difficulties in finishing (as well as data storage and analysis), but word limits restricted how much I was able to discuss. Very glad you've written this.

just a question, how much of the genome had to be assembled before they are being published?

Mike,

Nice article. Tempts me to finish that assembler & annotation tools I was working on! (It's a project that got pushed into the background, you know it goes.)

noyk,

In sense of being able to "publish" the data by placing it in the databases, any amount. In the sense of publishing it in the research literature, that depends on the "story" behind your sequencing effort (what it is that you are trying to find/show).

Interesting article. Today, Bacterial Genomics are in advanced studies. I hope it will bring more benefits in the future. from bathroom singer.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the Amazon What do accommodationists do about creationist politicians? I've Been Told You Can Get Flu From the Flu Shot: False! Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report" XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS! Coulter Goes All Science-y in Bid to Disprove Evolution Yet another bad day for the anti-vaccine movement 2011 Antibiotics: Killing…