The Gene Sherpa predicts that Complete Genomics will win the Archon X Prize in Genomics in 2010. In the comments, Keith Robison is wisely skeptical. I agree with Keith - it's unlikely that the X Prize will be won this year, and if it is the winner is unlikely to be Complete Genomics.
For those who don't know the prize, here's the brief summary: the X Prize Foundation will give US$10 million to the first team to satisfy the following conditions:
- sequence 100 human genomes within 10 days or less, with
- an accuracy of no more than one error in every 100,000 bases sequenced [note that the stated error rate on this page is mistakenly quoted as one in 10,000 bases], with
- sequences accurately covering at least 98% of the genome, and
- at a recurring cost of no more than $10,000 per genome.
Sequencing technology is developing fast, but it seems unlikely that these conditions will be successfully met in 2010. Here's why:
Firstly, full coverage of 98% of the genome is still a challenge for the current crop of short-read sequencers (such as Complete Genomics). Much of the human genome is highly repetitive, making it extremely difficult to place a short sequence read uniquely in its correct location in the genome. In its most recent publication Complete Genomics quoted simulation data suggesting the maximum possible coverage with its current technology is 98%, and in reality achieved coverage of between 86 and 95% of the genome. That will certainly improve as the technology moves forward, but it will be seriously challenging to reach 98% - and that's not even counting the non-trivial fraction of the genome that is too repetitive to be included in the reference genome.
Secondly, and more importantly, the required error rates are much too stringent to be achieved by current short-read technologies. Complete Genomics can just about meet the one in 100,000 requirement when it comes to single base variants (SNPs), but the error rate is substantially higher than this both for small insertions and deletions and (crucially) for the large-scale rearrangements known as structural variants, which involve the insertion or deletion of over 1,000 bases of material.
Having seen first-hand the challenges of calling insertion/deletion and structural variants from short-read sequence data, I'm pretty skeptical about the probability that the error rate for these variants can be reduced to one in every 100,000 bases. Even long-read technologies such as the Pacific Biosciences platform will struggle to call these accurately enough to meet the Prize's requirements.
This is not to downplay Complete Genomics' achievements: I've been seriously impressed with what the company has achieved since it released its first human sequencing data last February. I've also watched the attitude of the genomics community shift from hostility, through curiosity, to genuine interest; I suspect we'll see some non-trivial outsourcing of genome sequencing to the company even by large genome facilities during 2010. The company could certainly meet the cost requirements of the Prize, especially once its new Californian facility is up and running smoothly (some time early this year); nonetheless, I think the other conditions are beyond the likely capabilities of the Complete system in 2010.
When will the Prize be won? Keith Robison predicts that a win is at least two years away; I think he might be right, although I'd give 50-50 odds of a successful attempt in 2011 (but reserve the right to modify that prediction based on technology developments this year!).
Anyway, we'll know very soon if an attempt is likely in 2010: to qualify for the prize in a given year, a team must have registered for the prize by January 15th of that year at the latest. At this stage the list of registrants includes only quite unlikely candidates (e.g. 454, whose technology is too low-throughput, and the totally unproven ZS Genetics), so unless Illumina, Life Technologies, Complete Genomics, Pacific Biosciences or Oxford Nanopore registers within the next 11 days it's extraordinarily unlikely that we'll see a win this year.
I'll keep you posted.
More like this
The world of genomics is changing. It was initially about sequencing the genome a single representative individual from a particular species.
...that is, if you still think that a genome sequence tells all secrets about someone's success in science etc.  ;-)  
What happens when I mention a paper describing two more Drosophila genomes?
Genome size can be measured in a variety of ways. Classically, the haploid content of a genome was measured in picograms and represented as the C-value.
 
 
  
 
Daniel,
In addition to net coverage and accuracy, perhaps the most onerous X PRIZE competition stipulation is the sequencing of DIPLOID genomes (6 billion bp) and "complete genotyping of each chromosome."
That puts a premium on read length and haplotyping -- Complete may get there with its "long fragment read" approach -- but unfortunately suggests The Prize won't be claimed for some time. Damn!
Kevin
Hi Kevin,
Ah - I wasn't sure if that wording necessarily required that the chromosomes be completely phased, or whether it would be sufficient to just have an accurate (diploid) genotype call at every individual base along the genome. I now see the competition guidelines state: "A rearrangement or haplotype error counts as one error" - so it would seem you're right that a completely phased diploid genome is required.
In that case you're absolutely right: there's no technology around that could meet these requirements, and I'm pretty dubious we'll see one even close by 2011. Complete's LFR approach is elegant, but it won't provide sufficiently good haplotyping to produce <1 switch error per 100,000 bases.
Kevin's right, I'm afraid. But maybe we should start a pool?
If we ignore the cost and time limit, is current technology able to deliver the other requirements? If so, at what cost and time frame?
No - as described in detail in the post, current technology cannot meet the coverage and accuracy requirements.