A paper just published online in Nature Genetics describes a brute force approach to finding the genes underlying serious diseases in cases where traditional methods fall flat. While somewhat successful, the study also illustrates the paradoxical challenge of working with large-scale sequencing data: there are often too many possible disease variants, and it can be extremely difficult to work out which are actually causing the disease in question.
The authors looked at 208 families where multiple members suffered from mental retardation and where the family history was consistent with the underlying gene being carried on the X chromosome. In most cases the families weren't large enough to use linkage analysis to narrow down the location of the gene - in other words, the disease-causing mutation could be almost anywhere among the more than 800 genes scattered along this chromosome.
In these cases the traditional approaches of genetics break down - apart from screening the known genes involved in mental retardation and hoping for a lucky break, there's little that can be done to find the gene responsible. The researchers thus took advantage of automated large-scale DNA sequencing to simply analyse the protein-coding regions of nearly every gene on the X chromosome.
That's a total of one million DNA bases per patient - a
particularly impressive figure given it was generated using traditional
Sanger sequencing rather than one of the massively high-throughput
second-generation sequencing platforms now available.
The researchers found many genetic variants that would be expected to disrupt gene function: almost 1000 changed the predicted protein encoded by a gene, 22 introduced unusual "stop" signals, 15 changed the reading frame and 13 were found in strongly evolutionarily conserved regions associated with RNA processing.
the 42 variants most likely to cause disease (so-called "truncating"
variants) 38 were found in only one family, and these tended to cluster
together in specific genes - for instance, one gene contained 5
different rare, damaging mutations. However, many of these variants were found in both patients and their healthy male siblings,
suggesting that they are not causative in mental retardation. These
genes could represent subtle predisposing factors for mental
retardation, but it's likely that most of them are simply genes that
can be inactivated with little or no deleterious consequences for
Overall, only nine genes showed strong evidence for
disease-causing mutations. The researchers went on to sequence these
genes in a further 914 mental retardation patients and over a thousand
controls, but found only a handful of likely disease-causing mutations
in these genes in other patients.
Although the technical
achievement is impressive, the picture from this survey is somewhat
depressing (although not really surprising) for researchers interested
in using large-scale sequencing to discover disease-causing variants.
It's a clear demonstration that even examining the majority of protein-coding sequence will be insufficient to capture most of our nasty genetic secrets
- many of these lurk deep in non-coding DNA, while a fair chunk of the
remainder simply hide in the biological noise resulting from all of the
other non-disease-causing variants in the genome. In this study it's
likely that the researchers have actually uncovered a fair number of
disease-causing mutations (for instance, among the almost 1000
protein-altering variants) but are currently simply unable to
distinguish them from benign polymorphisms.
What's the solution? More sequencing,
for a start - digging deep into the non-coding portions of the genome,
and also ensuring very accurate coverage of the protein-coding portions
(in this study an average of just 75% of the targeted regions were
actually successfully sequenced in any given individual). This is
already entirely feasible due to the emergence of second-generation
sequencing, and will become rapidly more affordable as sequencing costs
drop. Already there are research groups around the world planning
massive sequencing studies to identify rare mutations underlying severe
But sequencing won't be enough: we need much better methods for sifting out the truly function-altering genetic variants from the biological noise.
This is already difficult enough for protein-coding regions (as this
study demonstrates); we currently have virtually no way of picking out
disease-causing variants in the remaining 98% of the genome. There's a
clear need for developing highly accurate and comprehensive maps of the
functional importance of each and every base in the human genome,
using all of the tools at our disposal - something that will keep us
geneticists busy long after we've run out of genomes to sequence.
A program developed by Cornell researchers deduced the natural laws without a shred of knowledge about physics or geometry. The research is being heralded as a potential breakthrough for science in the Petabyte Age, where computers try to find regularities in massive datasets that are too big and complex for the human mind."
Programs like these are in their infancy but they are enjoying some successes. It isn't just genetics that is drowning in data.
Did you read about the Incidentalome?
Noise is the big problem here. And will continue to persist for decades.