Common copy number variation doesn't explain much complex disease risk - but why not?

Wellcome Trust Case Control Consortium. (2010). Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls Nature, 464 (7289), 713-720 DOI: 10.1038/nature08979

The Wellcome Trust Case Control Consortium has just published the results of a massive survey of common, large DNA duplications and deletions (collectively termed copy number variation, or CNVs) in 16,000 patients suffering from complex diseases and 3,000 controls. The results come as no surprise, but are nonetheless disappointing: the study identified absolutely no novel CNVs associated with complex disease. Although three such variants were found to alter disease susceptibility, all three had been identified from previous studies.

The study's findings suggests that - despite their size - common CNVs play very little role in the etiology of common, complex diseases like rheumatoid arthritis and type 2 diabetes, and researchers will have to look elsewhere to uncover the notorious "missing heritability" for these diseases.
The result shouldn't come as a shock: the major conclusions of this study were presaged by a paper published in Nature late last year (and led by colleagues at the Sanger Institute). This earlier study, using far fewer samples, showed that most common CNVs are well "tagged" by nearby single nucleotide polymorphisms (SNPs), suggesting that any with substantial effects of common CNVs on disease risk would likely have been picked up by earlier SNP-based genome-wide association studies. It also showed that despite this tagging only a handful of common CNVs could be found that were correlated with known disease-associated SNPs
The authors of the earlier paper thus concluded:

[...] CNV could only explain a small minority of the disease risk already accounted for existing GWAS studies, let alone the larger (for most diseases) bulk of 'missing' heritability that remains unaccounted for [...]

The new results from the WTCCC should thus really be viewed as extremely large-scale validation of a known negative result, confirming an already-suspected limited role for common CNVs in complex disease susceptibility.

So, no real surprises here. But still, an obvious question remains: why don't common CNVs play a major role in complex disease susceptibility?
This result certainly seems counter-intuitive: many CNVs are large, sometimes deleting or duplicating many thousands of bases of DNA, and one might thus expect a priori that such variants will be much more likely to have a functional effect on the genome than single-base SNPs. Yet based on their results, the WTCCC conclude:

Having completed these analyses the hypothesis that, a priori, an arbitrary common CNV is much more likely than an arbitrary common SNP to affect disease susceptibility is not supported by our data.

How can it possibly be the case that a deletions or duplications of thousands of bases can have the same probability of having a functional impact as a variant affecting a single nucleotide?

That question is not explicitly addressed in the paper. However, the key term in the quoted sentence above is "common": for any variant to reach the population frequency required to be detectable in this study (around 5%), it has to have run the gauntlet of purifying natural selection. Genetic variants - either SNPs or CNVs - with sizeable effects on disease risk will (in most cases) have been prevented by selection from ever reaching this frequency in the population.
That means that although a brand new CNV would be predicted, on average, to have a substantially larger effect on fitness than a brand new SNP, we should expect common CNVs to show roughly the same distribution of effects on disease risk as common SNPs due to the ruthless filtering out of seriously deleterious variants in both classes by selection. 
So conditioning on a variant being common, its predicted effect on disease susceptibility will be small regardless of whether it is a CNV or a SNP. And because there are far, far fewer common CNVs in the population than common SNPs, the total contribution of CNVs to disease risk is substantially less
So the negative outcome of this very large study was fairly predictable - although that's easy to say in hindsight, of course!
Where to next? The field has already moved on with a new focus on rare variants, which (given the selection-based argument above) seem far more likely to yield useful findings. This year will see the launch of several very large studies taking a variety of approaches to dig into the lower end of the frequency spectrum: imputation using existing data-sets; new genome-wide association chips containing larger numbers of rare SNPs; and large-scale sequencing of candidate genes, whole exomes and even entire genomes. Rare variant discovery has already proved successful in the CNV field, and it seems likely that the next round of CNV association studies will prove enormously more fruitful than this study.

More like this

I've heard Evan Eichler make the argument that the types of copy number variation most likely to influence phenotypes are not being assayed with current technologies--that there could be a bunch of associations in rapidly evolving, difficult to genotype gene families that vary massively in copy number between individuals. I wasn't sure how much of the genome is in places like this; you have any thoughts?

This is really interesting stuff. Certainly most (all?) CNVs in Eichler's regions wouldn't be well-tagged by SNPs (both because of rampant seg dups, and also higher rates of recurrent mutation), nor genotypable by standard assays, so I guess there could theoretically be a massive amount of risk hidden in there.

I honestly have no idea how much of the genome sits in these regions (especially given a lot of them aren't present in the reference sequence), but it's certainly plausible that it's a lot. Does anyone out there have a decent estimate?

But in any case, the real question is whether there's good reason to think that common CNVs in these regions will tend to have much larger effects on disease susceptibility than those in genotypable regions. I remember Don Conrad talking about how much non-genotypable CNVs would need to be enriched for functionality before they could start to really negate the "common CNVs don't contribute much to complex traits" claim, and it's a lot.

So I wouldn't be shocked if some of the risk was hiding out in these regions, but I find it hard to imagine that they'll save common CNVs as a major source of heritability. Still, I'd be happy to be proved wrong!

Nice post, Daniel.

It would be possible to get a reasonable power on rare CNVs if they were typically highly penetrant on complex disease traits. If rare CNVs had sufficiently large effects, one might expect the diseases to start looking like Mendelian traits in pedigrees (like BRCA1 risk alleles).

Such examples are fairly rare, suggesting that they don't often have large enough effects to make them easy to find.

Nice post, nice paper. It certainly makes life easier for the SNP scanning companies, the opposite result would have devalued the products - perhaps why 23andme tweeted it straight away! It does make sense though, as you say, that the SNPs are doing their job well of tagging genetic variation.

A couple of a priori reasons you might not expect common anything to be associated with diseases on average:

1) Common variants (that aren't balanced or hitchhiking) are unlikely to be deleterious, otherwise they wouldn't be common (deleterious in an evolutionary source; Alzheimer's probably doesn't effect contribution to subsequent generations much);
2) Common variants are old whereas the overwhelming majority of the total length of coalescent tree is found in the ridiculously bushy tips that happened since agriculture, which is very recent in the grand scheme of things;
3) Relaxation of constraint (and therefore tolerance of deleterious variants) is likely to be observed during the period of rapid population growth which is again, a recent phenomenon and therefore not likely to be common.

Of course, I'm talking about identical by descent alleles. If we start talking about identical by state where state is taken to be some sort of general incapacitation (like all independent kinds of frameshift or all independent duplications influencing a locus) then the story might be different. Have people looked at it in that sense? Or am I missing something?

I can't believe you all fell for such an obvious april fools joke!

In fact, virtually all susceptability to common diseases is down to common copy number variants.

Hi John,

Well, this study wasn't at all powered to pick up very rare variants regardless of their penetrance - these variants wouldn't have shown up in the initial discovery panel (based on only 20 CEU and 20 YRI) and thus weren't included on the chip.

I'm not sure about your statement that highly-penetrant low-frequency CNVs are rare (except in the tautological sense, of course!): there are plenty of known Mendelian disease-causing large deletions, and rare large-effect CNVs seem to be contributing non-trivially to autism, schizophrenia and mental retardation.

But low-frequency variants with moderate effect sizes (say, 0.1-1% frequency with odds ratios of 2-5) simply haven't yet been well assayed for disease risk by any approach: they're too rare to be picked up by GWAS, but not penetrant enough to show up in linkage. I'm still hopeful that digging into this section of the frequency-effect size spectrum will yield some appreciable fraction of disease risk heritability.

An example for p-ter @1:…

From the intro to the latter:

The genes encoding the killer immunoglobulin-like receptors (KIR) are situated within a segment of DNA that has undergone expansion and contraction over time due in large part to unequal crossing over. Consequently, individuals exhibit considerable haplotypic variation in terms of gene content. The highly polymorphic human leukocyte antigen (HLA) class I loci encode ligands for the KIR; thus, it is not surprising that KIR genes also show significant allelic polymorphism. As a result of the receptorâligand relationship between KIR and HLA, functionally relevant KIRâHLA combinations need to be considered in the analysis of these genes as they relate to disease outcomes.

So these are hot candidates for most immune-mediated diseases, and almost impossible to assay in bulk currently.

So far, we don't have good approaches to tag all CNV copies, and the resolution of copy number determination with array-based quatatition remains very low. One may distinguish the difference between 1 and 2, how about 2:4 and 10:20? Persons 1 and 2 both may have 10 copies, but in their haplotypes, they may have 4+6 and 5+5. With these said, disease association studies is kinda premature. Like to learn your thoughts about these issues.