So if you haven't read those articles, already, go and do so now - when you come back, I want to talk about the potentially worrying implications of this paper for the future of personal genomics.
There's really only two pieces of jargon you need to know to follow this story, and those are the two classes of genetic variants that alter the expression levels of genes: cis and trans variants. To put it simply, cis variants are those that are found close to a gene, and trans variants are those that act on a gene's expression levels but are found far away in the genome (typically on another chromosome).
The ominous message from this paper is this: by examining the effects of genetic ancestry on local gene expression levels in samples from African-Americans, the study provides the first solid estimate of the proportion of the variance in gene expression that is determined by cis-acting variants - and suggests that this proportion is low (around 12%). In other words, this paper suggests that ~88% of the variants altering the expression of a gene are trans variants found far from that gene.
That's worrying for two reasons. Firstly, it suggests that it may be even more difficult than expected to untangle the molecular basis of the signals found in recent genome-wide association studies for common diseases. Many of these signals fall outside known genes, and the default hypothesis is that the causative variants underlying these signals somehow affect the expression of nearby genes. If in fact many of these variants exert their effects through regulation of distant genes it will be much more difficult to nail down the pathways involved, particularly for diseases where it is hard to get samples of the affected tissues from living individuals (e.g. psychiatric diseases).
But the most troubling implications of this finding are for the future of personal genomics; bear with me for a moment, because this will take a little background to explain.
We already knew from the relatively poor yield of genome-wide association studies (which target common variants) that a substantial fraction of the genetic risk of common diseases - especially at an individual level - is likely the result of rare genetic variants of moderate effect. Such variants are completely invisible to the chip-based approach of current personal genomics companies, but they will be detected by rapid, affordable whole-genome sequencing methods that will almost certainly be available within the next five years. The problem is that any individual genome will contain many rare variants, only a fraction of which are actually disease-causing - so the major challenge facing personal genomics right now is developing methods for inferring the functional effects of novel sequence variants.
As I noted above, another of the lessons from recent genome-wide association studies is that many disease-associated variants fall outside protein-coding genes, and are thus likely to increase risk by disrupting patterns of gene expression (rather than altering protein sequences). This is a problem, because our current understanding of the way DNA regulates gene expression still in its infancy, and we can make only the crudest guesses regarding whether or not a new-found variant will alter gene expression and, if so, in what direction - and that's even for variants that are found close to a gene. Making de novo predictions about the effect of a novel sequence variant on distant genes will be immensely more challenging - but if this study is to be believed, that's exactly what will need to be done for the majority of expression-altering variants.
This is all rather ironic given that barely a month ago I was all cheerful about a recent paper showing that cis variants tend to cluster tightly around transcriptional start and end positions, making it much easier to nail down expression-altering variants found close to a gene. Now it seems that this will only apply to a small fraction of the overall bulk of expression-altering variants - a rather depressing revelation.
There are still caveats about the findings of this paper that may provide a glimmer of hope. The authors note one issue at the end of the discussion: the gene expression data are derived from only one tissue (white blood cells), and it will be important to extend this analysis to other tissues involved in common disease (such as pancreatic cells in diabetes). However, while the regulatory variants will differ from tissue to tissue, I'd be surprised if the big picture (in terms of the proportion of cis and trans variants) was strikingly different - unless anyone can think of reasons why proximal regulatory elements would systematically alter in importance from tissue to tissue?
A more interesting caveat is that this study essentially looked at the effect of between-population genetic variation on gene expression, by using the relative proportion of African and European ancestry within each region of the genome in admixed individuals, and it's unclear to me whether the same effect will necessarily be seen for within-population variation. In the comments over at Gene Expression, G from Popgen ramblings notes that the authors made an effort to address this issue (by looking at the population differentiation of cis and trans variants) and found no striking difference, but it's still possible that there are a few highly population specific trans variants acting on many genes that explain a large chunk of the variance. If that's the case, the fraction of the variation explained by cis variants may turn out to be substantially higher in within-population data - but I'll admit that it's a long shot.
If this picture of the distribution of cis and trans effects is accurate, then the only way to accurately predict the effects of novel expression-altering variants will be by assembling a genome-wide map of the regions affecting the expression of each disease-relevant gene in each disease-relevant tissue. That's a feat that will require some very clever biology, and will take far more longer than five years - so once again, you will have your genome sequenced long before anyone can tell you what it means.
Citation: Alkes L. Price, Nick Patterson, Dustin C. Hancks, Simon Myers, David Reich, Vivian G. Cheung, Richard S. Spielman (2008). Effects of cis and trans Genetic Ancestry on Gene Expression in African Americans PLoS Genetics, 4 (12) DOI: 10.1371/journal.pgen.1000294
Such variants are completely invisible to the chip-based approach of current personal genomics companies, but they will be detected by rapid, affordable whole-genome sequencing methods that will almost certainly be available within the next five years. The problem is that any individual genome will contain many rare variants, only a fraction of which are actually disease-causing - so the major challenge facing personal genomics right now is developing methods for inferring the functional effects of novel sequence variants.
I just want to make sure I understand what you're saying: we'll be able to genotype rare variants within 5 years, but we won't know their effects on disease risk or quantitative traits for significantly longer?
Warning: Long answer.
Well, we can already genotype rare variants (if we know where they are), and we can find them in targeted areas of the genome (through standard sequencing). What will be new in five years is the affordable ascertainment of pretty much every rare variant anywhere in the genome, through rapid whole-genome sequencing.
But yes, the problem will be inferring the effects of these variants on traits. For "somewhat rare" variants - those with a frequency of, say, 0.5-5%, the problem is easier: we'll pick most of them up from the 1000 Genomes Project, they'll be included on the next generation of SNP chips, and case-control studies with extremely large samples (tens to hundreds of thousands of participants) will have the power to detect their effect on disease risk. That will likely take care of an appreciable fraction of the population disease risk; these variants may well be clinically useful for stratifying people into low and high risk categories for targeted disease screening, at least for some traits.
But for real personal genomics we want to be able to get at the "very rare" variants - those with a frequency less than 0.5% or so. These are too rare to be detected by genome-wide association studies for the foreseeable future; but there's growing evidence that this frequency category is highly enriched for moderate- to large-effect disease risk variants (due to the effects of negative selection such variants rarely reach high frequency; common risk variants typically have tiny effect sizes). So personal genomics customers will want to know if they have one of these nasty variants. The problem is how we distinguish them from the thousands of other "very rare" variants that each person carries that don't increase disease risk.
In the absence of association study data, the only possible solution is de novo functional prediction - but to make perfectly accurate predictions of disease risk for all possible rare variants, we would need to know (1) what every region of the genome does; (2) how the functions of each region correspond to disease risk; and (3) exactly how all possible variants within a region could disrupt its function. Essentially, we would have to understand precisely how the human machine operates at a molecular level.
That's still a long way off; in the meantime, we can assign hazy probabilistic estimates of disease risk for rare variants by multiplying two uncertain estimates: (1) the probability that a variant disrupts a gene by (2) the probability that disrupting that gene increases disease risk. We can roughly estimate (2) by looking at the distribution of disease-associated common variants, and by incorporating information from functional networks. Estimating (1) is harder - and it gets harder still if most of the variants that alter gene expression are found far away from the gene itself, as this paper suggests.
Despite the uncertainty, I expect personal genomics companies will employ something like this algorithm to estimate overall disease risk estimate by summing these hazy probabilities over all of the novel variants in an individual's genome, added to the easier-to-calculate risk estimates derived from common variants. Early versions of this algorithm will be extremely unreliable; but as more biological information is fed into them they will become increasingly accurate. I like to imagine it as a probabilistic disease risk cloud around each person, slowly sharpening over the next few decades to eventually converge on a fairly reliable estimate of their true risk of each disease.
We already knew from the relatively poor yield of genome-wide association studies (which target common variants) that a substantial fraction of the genetic risk of common diseases - especially at an individual level - is likely the result of rare genetic variants of moderate effect.
There is also the possibility that many common variants contribute very small effects that cumulatively explain some further portion of the genetic risk. Less common things will, of course, also explain some fraction.
On a semantic note: are you using "rare" to mean private to families, or things that exist in the general population at a (very) low frequency? Very different animals!
Sorry, I completely lost track of this thread.
Re: common vs rare variants, I completely agree that common variants of small effect (and other sources of heritable variation e.g. epigenetic variants) will also make up a portion of the remaining variance - that's why I said that rare variants probably contribute a "substantial fraction", not the entirety.
As for the definition of "rare" - the convention on this isn't fixed yet, but I usually use "rare" to refer to alleles with a frequency between 0.1 and 5%, and "very rare" for alleles below 0.1% frequency. Both classes will obviously contribute to risk variance - but as you say, they will differ in terms of the best method to identify them, and the difficulties of validating them as causal.
re: conventions. Agreed that the definition is still evolving, but it's worth remembering that "common = >5%" was defined due to the sample size limitation of the HapMap - a point the HapMap folks themselves go to some pains to make. Where 1000 genomes and HapMap3 are taking us is the real distinction: "population" variants (ie those present in unrelated individuals of a population) and "familial" variants (aka mutations: either de novo or arising in a very recent ancestor; not found outside the immediate pedigree).
There are some pretty real differences in expectations for the biological effect spectrum of these two classes.
Thanks chris - very useful points. I'm hoping to start a series on the promise and challenges of rare variants early next year, so I'll be getting myself up-to-date on this field over the next couple of weeks. Feel free to provide suggestions regarding essential reading material!