Over at The Tree of Life, Jonathan Eisen asks:
What do people think are the potential benefits that could come from finishing?
For those who don't know what genome finishing is, I'll let Eisen give the short summary:
Finishing: Using any combination of laboratory, computational and other analyses one can both fill in gaps in the assembly and improve the quality of the assembly. This can generally be called "finishing"
In the context of microbial genomes, here are some of my thoughts about finishing (italics orignal; boldface mine):
Whole genomes don't come flying out of the sequencing machines: we have to take hundreds of thousands or millions of reads and stitch them together--what is known in genomics as assembly. It's pretty easy and fast to get a pretty good genome. By pretty good, I mean that most of the genome (~99%) is assembled into pieces 50,000 - 1,500,000 bases long.... Where the assemblers get hung up on with bacteria are repeated elements--regions of the genome that are virtually identical (they don't have to be completely identical, just close enough such that the assembler thinks they're identical reads with sequencing errors). Because the assembler can't figure out where to put these reads (they're all identical), it discards them--that's where the breaks occur...
This is a problem because some of the most interesting genes, such as antibiotic resistance genes, are found sandwiched between repeated elements, known as insertion sequence elements ('IS elements'; IS elements are one of the major reasons resistance genes move from plasmid to plasmid--plasmids are mini-chromosomes that themselves can move from bacterium to bacterium--and from plasmid to chromosome). What this means is that we can assemble an antibiotic resistance gene (or genes) but we might not know if it's found on a plasmid or on the chromosome--that's a pretty critical biological question. To further complicate things, different plasmids can have the same IS elements, along with the bacterial chromosome. Not only will these introduce breaks into the assembly, but they can also lead to accidentally assembling plasmids together or incorrectly incorporating them into the genome.
Now, we do have methods to close up these gaps--this process is called finishing, and it involves either targeted sequencing or manually parsing through the existing data. But these are open-ended, slow processes (particularly the targeted sequencing). Worse, this involves thinking, and, relative to computer algorithms, thinking is very slow. This is also really expensive. So we can get a pretty good assembly, but I think a lot of people, thinking back to the Sanger sequencing days, when most bacterial genomes were closed, are going to have to understand that if you want a lot of genomes, they will be 'pretty good' assemblies, not closed, finished ones.
To return to Eisen's question, I think finishing microbial genomes is important if you really have to localize genes to plasmids (or circularizing prophage). In infectious disease, that's pretty important. However, from this perspective, finishing might become a moot point if the new technologies (454 pyrosequencing and Illumina) improve to the point where genes of interest can be reliably localized to plasmids*. Likewise, if you're interested in the biology of repetitive elements, you'll need finished genomes.
So, regarding finishing, I think in about a year, we'll have very little need for complete finishing, unless the biological question requires it (e.g., repetitive elements).
*To get technical, as long as I can link a gene to a plasmid scaffold--a set of smaller sequences that I know are tied together, even though I lack some of the intervening regions--I'm happy.
Good additional information here and will post a link on my blog and in the friendfeed discussion. A few comments
1. Finishing does not have to require thinking, except for up front design of an automated system. I think with a little bit of up front work we could in essence design a compleltely automated first pass finishing system.
2. It has yet to be determined in publications what the raw quality of shotgun assemblies are with the various sequencing methods. We have some indications but we do not know enough. For example, how "random" is the output from 454 and Illumina? What are the biases?
3. Seems like there is strong support for finishing from an almost esthetic point of view. It feels better to be done. This is hard to use an a reason to spend $$$ on finishing but I do find it interesting.
4. One challenge we have not really dealt with is that the sequencing technology these days is changing much faster than it used to and the assembly software has a hard time keeping up. So in some cases the software makes mistakes that can be fixed but then the sequencing methods change enough that the fix no longer is that useful.
As an experimentalist, an unfinished genome may not be of tons of use for me, if a region I'd like to know something about is unfinished. Mike's example about assignment to plasmid or chromosome is one example, but there could be lots of others. If the genome sequencing is being done in part to support functional work, then we just might need to know what goes in the missing part of the genome.
Seems like there should be cheaper, experimental ways to determine if a gene is on a plasmid or chromosome rather than finishing [apply logic same argument to other concerns/questions]. I think what we need to consider is how to answer the question directly rather than how to force approaches that don't really apply anymore to said question. Then how to get experimental data from other methodologies to feed back to the annotations and assemblies.