A couple of weeks ago I attended the Human Microbiome Research Conference. At that meeting, one talk by Bruce Birren (and covered by Jonathan Eisen) mentioned something that was completely overlooked by the attendees. Now, I don't blame them, since what Birren mentioned was about bacterial genomics, not the human microbiome. But here's what I tweeted about Birren's talk (TWEET!):
B. Birren-E. coli K-12 can be assembled into 1 scaffold for hundreds of $s with Illumina seq & new jumps
Let's unpack this below the fold.
When we sequence a genome, we actually sequence small pieces (with the Illumina technology, each read is about 180bp* (one bp is one nucleotide pair), while an E. coli genome is about 5,000,000 bp), and then assemble them, like a jigsaw puzzle into larger sections of contiguous sequence, called contigs (as in contiguous...). We can link these contigs using what are known as 'jumps.' Here, we sequence the ends of a large piece of DNA (Birren's colleagues were using 5 kb pieces). This allows us to scaffold together contigs, into larger pieces (which have gaps of known size) called... scaffolds (clever term, no?). This allows us to deal with repeated elements--DNA sequences that are identical (or nearly so) and are larger than a single read:
Where the assemblers get hung up on with bacteria are repeated elements--regions of the genome that are virtually identical (they don't have to be completely identical, just close enough such that the assembler thinks they're identical reads with sequencing errors). Because the assembler can't figure out where to put these reads (they're all identical), it discards them--that's where the breaks occur.
This is a problem because some of the most interesting genes, such as antibiotic resistance genes, are found sandwiched between repeated elements, known as insertion sequence elements ('IS elements'; IS elements are one of the major reasons resistance genes move from plasmid to plasmid--plasmids are mini-chromosomes that themselves can move from bacterium to bacterium--and from plasmid to chromosome). What this means is that we can assemble an antibiotic resistance gene (or genes) but we might not know if it's found on a plasmid or on the chromosome--that's a pretty critical biological question. To further complicate things, different plasmids can have the same IS elements, along with the bacterial chromosome. Not only will these introduce breaks into the assembly, but they can also lead to accidentally assembling plasmids together or incorrectly incorporating them into the genome.
Basically, for ~$600-$800, we can generate a really good bacterial genome. While it's not finished (all gaps are sequenced), it is closed, and, as noted above, this is a critical advance. I think we'll know very shortly how well this new technology works with more difficult genomes, although I would add, in my experience, some clinically relevant pathogens, such as S. aureus (including MRSA) are pretty straightforward genomes (they don't have a lot of repeated elements). Keep in mind, we're in an era were the actual sequencing (versus everything else) is cheaper than everything else we need to do to sequence a genome.
But what is amazing is that we can generate clinically and epidemiologically relevant data** very rapidly. On a shiny new Illumina Hi-Seq, we're talking about hundreds of genomes in a week (assuming the other steps aren't rate-limiting...). And these are genomes which could give us really good 'positional' information (i.e., is the antibiotic resistance gene found on a plasmid, and thus able to move easily between bacteria).
Very exciting. This will completely blow open the field of microbial population genomics and makes molecular epidemiology very, very powerful.
*The reads are actually 100 bp, but with some tricks we have them overlap to form a 'single' read of 180bp.
**If you're interested in the sequence of repeated elements, such as IS elements, you're outta luck. They're interesting, but, in my opinion, not critical for understanding the spread of resistance genes and resistant organisms.
Another option is physical mapping. I'm working with technologies that currently are too expensive (~$450) but map an entire chromosome into 1-3kb resolution. I'm convinced that this very month the price can be reduced to $100 and that in 2 years it will be $50 or less and include large plasmids; or still be $100 and include even moderate sized plasmids (25Kb or more, perhaps).
'Jump' sequencing will still not entirely solve the problem with 5-10kb plasmids, neither will this technology. However, plasmid profile gels can tell you when you have something like that in the way.
Cheers, interesting stuff. This is off topic maybe, but I'm fascinated by en masse sequencing of multiple microbial species at a time, from the gut etc. How do people go about stitching a genetic mixture together?
to clarify a small point - a typical short read assembler will not "discard them[reads from repeats]" but will recognize them based on higher coverage and place them accordingly without trying to guess gene order
you should get three contigs:
kmers/reads from the repeats do not belong exclusively to any one repeat but are shared equally