For those who don't know the prize, here's the brief summary: the X Prize Foundation will give US$10 million to the first team to satisfy the following conditions:
- sequence 100 human genomes within 10 days or less, with
- an accuracy of no more than one error in every 100,000 bases sequenced [note that the stated error rate on this page is mistakenly quoted as one in 10,000 bases], with
- sequences accurately covering at least 98% of the genome, and
- at a recurring cost of no more than $10,000 per genome.
Sequencing technology is developing fast, but it seems unlikely that these conditions will be successfully met in 2010. Here's why:
Firstly, full coverage of 98% of the genome is still a challenge for the current crop of short-read sequencers (such as Complete Genomics). Much of the human genome is highly repetitive, making it extremely difficult to place a short sequence read uniquely in its correct location in the genome. In its most recent publication
Complete Genomics quoted simulation data suggesting the maximum possible coverage with its current technology is 98%, and in reality achieved coverage of between 86 and 95% of the genome. That will certainly improve as the technology moves forward, but it will be seriously challenging to reach 98% - and that's not even counting the non-trivial fraction of the genome that is too repetitive to be included in the reference genome.
Secondly, and more importantly, the required error rates are much too stringent to be achieved by current short-read technologies. Complete Genomics can just about meet the one in 100,000 requirement when it comes to single base variants (SNPs), but the error rate is substantially higher than this both for small insertions and deletions and (crucially) for the large-scale rearrangements known as structural variants, which involve the insertion or deletion of over 1,000 bases of material.
Having seen first-hand the challenges of calling insertion/deletion and structural variants from short-read sequence data, I'm pretty skeptical about the probability that the error rate for these variants can be reduced to one in every 100,000 bases. Even long-read technologies such as the Pacific Biosciences platform will struggle to call these accurately enough to meet the Prize's requirements.
This is not to downplay Complete Genomics' achievements: I've been seriously impressed with what the company has achieved since it released its first human sequencing data last February
. I've also watched the attitude of the genomics community shift from hostility, through curiosity, to genuine interest; I suspect we'll see some non-trivial outsourcing of genome sequencing to the company even by large genome facilities during 2010. The company could certainly meet the cost requirements of the Prize, especially once its new Californian facility is up and running smoothly (some time early this year); nonetheless, I think the other conditions are beyond the likely capabilities of the Complete system in 2010.
When will the Prize be won? Keith Robison predicts that a win is at least two years away; I think he might be right, although I'd give 50-50 odds of a successful attempt in 2011 (but reserve the right to modify that prediction based on technology developments this year!).
Anyway, we'll know very soon if an attempt is likely in 2010: to qualify for the prize in a given year, a team must have registered for the prize by January 15th of that year at the latest
. At this stage the list of registrants
includes only quite unlikely candidates (e.g. 454, whose technology is too low-throughput, and the totally unproven ZS Genetics), so unless Illumina, Life Technologies, Complete Genomics, Pacific Biosciences or Oxford Nanopore registers within the next 11 days it's extraordinarily unlikely that we'll see a win this year.
I'll keep you posted.
Subscribe to Genetic Future.