I discussed the second-generation sequencing company Complete Genomics a couple of weeks ago (see here and here). These guys are unique in that they offer their technology only as a service, rather than the usual business model of selling platforms to genomics facilities, and a highly restricted service at that: Complete has stated fairly categorically that it will only be sequencing human genomes (no plants, algae, or even chimpanzees!).
Whether this business model will prove a commercial success remains to be seen, but the company seems to have impressed the genomics community with its early-release data. Today the company officially announced a partnership with the Broad Institute to sequence five complete genomes. This isn't exactly breaking news (the collaboration was openly discussed at AGBT a few weeks ago), but these details are new:
Complete Genomics will use its proprietary DNA sequencing technology to sequence five genomes from samples provided by the Broad Institute. The first genome sequenced will be a test case that has already been studied extensively by the scientific community. The other four genomes are tumor and matched-pair normals; one pair will be used to study glioblastoma and the other melanoma.
Presumably the first sample will be one of the anonymous HapMap DNA samples that are currently being sequenced as part of the 1000 Genomes Project,
giving the researchers at the Broad a solid baseline to determine the
accuracy of the Complete Genomics platform. The other four genomes will
then be from two cancer patients (one sample from the tumour and one
from normal tissue in each case) to study the genomic anarchy that underlies cancer
formation, as part of a vastly larger series of studies of this type.
price isn't stated on the press release, but Complete confirmed to me
in an email a while back that these early pilot projects will cost
$100,000 per 5 genomes. That price will drop rapidly as Complete scales
up its sequencing facility: at their commercial launch in June 2009,
the company plans to release a formal pricing scheme that "will support
genome sequencing price", according to the email.
Update on error rates
In my previous post
I speculated about the possible number of errors that might pop up over
a whole genome sequence using Complete's platform. The issue here is
that the genome is very large, so even quite a low error rate can
result in a high number of "noise" variants, potentially obscuring the
signal from real genetic variations.
I recently spoke over the
phone to Geoff Nilsen, a senior bioinformatician at Complete, about the
company's own estimates of the number of errors they expect to see in a
complete genome sequence. Nilsen emphasised that these estimates are
still very rough, being based on experimental validation of a
relatively small number of variable sites from their pilot genome, and that the company has
further software and process development underway to characterise and
reduce their error rates.
Still, based on preliminary quality control data, Nilsen very cautiously estimated that they had somewhere in the vicinity of 80,000-100,000 false positive calls, and perhaps around 1,000 false negatives,
for single nucleotide polymorphisms in their pilot genome sequence. I
emphasise again that these are estimates with very large error bars -
for instance, the 95% confidence interval is 78,000 +/- 236,000 for false
The data are even more preliminary for insertion
and deletion variants, which lack a clean reference data-set (for the
single base variants above Complete was able to rely on data from the
HapMap project). At this stage Complete has validated a set of 57
homozygous insertion/deletion variants, all of which were called
accurately by their platform, but this is far too small a data-set to
be extrapolating to genome-wide error rates.
false positive and false negative error rates will of course be
absolutely crucial for many applications of whole-genome sequencing -
for instance, clinicians interested in finding the single mutation that
causes a severe congenital disease will not want to sift through huge
numbers of false variants, or to find that their one target mutation
was missed by the base-calling algorithm.
Nilsen told me that at this stage that they would not expect clinicians to apply their sequencing service in this way to individual patients;
they expect the first clinical applications to come from sequencing
studies of much larger numbers of patients and appropriately matched
controls. This is perfectly reasonable caution - none of the current
whole-genome sequencing technologies is "clinic-ready" in the sense of
providing a highly accurate single-patient diagnostic test, and of
course the "noise" from normal genetic variation between individuals
will also make it important to use large numbers of patients to hunt
down disease-related mutations even in the absence of sequencing error.
For non-clinical applications (e.g. population genetics) these
types of error rates seem more than acceptable; researchers are used to
dealing with much noisier data than this. If Complete can indeed offer
a full genome sequence at this quality for $5000 I suspect they will be
receiving plenty of interest from geneticists interested in normal
human variation as well as disease.
How Complete's estimated
error rates hold up in the real world - particularly for
customer-submitted samples of varying quality - remains to be seen, but
we should soon get a better idea from the results of collaborations
between Complete and large genomics facilities like the Broad.
How do they calculate false positives/negatives for a SNP with two common alleles? Isn't some kind of reproducibility stat more appropriate?
I wish I was born 20 years later. It's the most exciting time in the whole history of biology!
well you are still alive to enjoy the exciting biology!
I thought Complete was only going to do human genomes. I'm wondering how different the tumor genomes are and what problems this might give for their platform.
How stochastic are the errors across runs? Could you follow up a possible clinical false positive with another several runs and expect to exponentially reduce errors?
I've been waiting for this, These guys are incredible!!! Just the fact that they have been able to sequence multiple genomes this year of the highest quality is mind boggling!
In response to question from "Anonymous": We used the genotype information from the HapMap project to calculate the false positive/false negative rates AND used our sanger sequencing results for that as well...The HapMap genotype data was gathered with a different platform for this same individual, allowing us to do the comparison. In those cases where we differed from the HapMap data, we did a small set of follow up studies with a third platform (good 'ol sanger sequencing) to calculate the error rates.