Genome-wide association studies: failure or success?

The latest issue of the New England Journal of Medicine has four excellent and thought-provoking articles on the recent revolution in the genetics of common disease and its implications for personalised medicine and personal genomics. Razib and Misha Angrist have already commented, and there's also a thorough lay summary by Nick Wade in the NY Times.

The scene is set by a brief but useful review of progress in genome-wide studies of human disease, which is worth reading if you need to get yourself up to speed on the scope of progress in modern disease genomics. The main course, however, is provided by three opinion pieces tackling the recent results of genome-wide association studies from quite different angles. I'll discuss two of these articles in this post, and will hopefully have time to tackle the third article - and the implications of this debate for the future of personal genomics - in a second post.

Before I get to the opinion pieces, however, here's a whirlwind tour of the history of research into the genetic basis of common diseases (such as type 2 diabetes): prior to 2005, the field was largely a scientific wasteland scattered with the embarrassing and wretched corpses of unreplicated genetic association studies, with barely a handful of well-validated genetic risk factors peeking above the noise; in 2005, the first genome-wide association studies (GWAS) emerged from the combination of the hugely successful HapMap project with new technology for testing hundreds of thousands of single-base genetic variants simultaneously; from 2005 until today, GWAS have rapidly grown in scale and complexity, with studies now looking at over a million genetic markers in cohorts approaching a hundred thousand individuals.

From the outset the aim of genome-wide association studies has been two-fold: (1) identifying markers that can be used to predict individual disease risk; and (2) highlighting the molecular pathways underlying disease, providing potential targets for therapy.

There's little disagreement in the scientific community that the appearance of GWAS has changed the face of common disease genetics: from that handful of genuine associations in 2005, we now have somewhere in the vicinity of 400 regions of the genome displaying replicated associations with around 70 common diseases or complex traits. Where the experts differ, however, is on the issue of whether continuing to increase the scale of GWAS to ever-larger sample sizes is worth the substantial costs, and whether personal genomics companies like 23andMe - who use GWAS results to provide personalised genetic disease risk predictions - are providing a valuable service or a scam.

The first two NEJM opinion pieces represent two poles of the diversity of views within the genetics community on these issues, and illustrate both the important lessons that have emerged from GWAS and the many questions that are yet to be answered.

Have GWAS actually been useful?

The first article off the rank is from notable genome-wide association study contrarian David Goldstein. Goldstein's voice dominates Nick Wade's NY Times article, and not for the first time - Goldstein also featured in a Wade article in September last year, which I quoted approvingly at the time.

Goldstein's argument is straightforward: although GWAS have been "strikingly successful" in identifying sites of common genetic variation associated with complex diseases, the variants they have found - both individually and altogether - explain a small fraction of the overall genetic contribution to common disease risk. These common variants typically only increase risk by 10 to 70% - so for a disease that affects 2% of the population, carrying one of these variants means your disease risk increases to 2.2 to 3.4%. For even heavily studied diseases the variants found to date typically explain less than 20% of the heritable variance in disease risk.

This point is non-controversial, and has been apparent for quite a while - over a year ago, for instance, I wrote a post listing the places where the missing heritable risk could be hiding. The major response from researchers performing GWAS has been to continually increase sample sizes, giving them power to reveal variants with ever-smaller effect sizes.

However, Goldstein argues that this approach is doomed to failure. Based on what we know about the distribution of effect sizes of risk variants, he argues that if common risk variants underlie the totality of genetic risk there must be a ridiculously large number of them; and that means that these variants will provide little useful insight into the biology of disease:

If common variants are responsible for most genetic components of type 2 diabetes, height, and similar traits, then genetics will provide relatively little guidance about the biology of these conditions, because most genes are "height genes" or "type 2 diabetes genes."

Goldstein's negative tone is balanced by a buoyant review from Joel Hirschhorn celebrating the successes of the GWAS era. Hirschhorn argues that despite the failure to uncover the majority of the genetic disease risk, GWAS have in fact contributed substantially to our understanding of disease mechanisms. Here he has two striking examples to back him up: the revelations of the involvement of the complement pathway in age-related macular degeneration (AMD), and of the autophagy and IL23 pathways in Crohn's disease. These pathways weren't known to play a role in these diseases prior to GWAS, but the evidence for their involvement from GWAS (particularly in the case of AMD) is unambiguous.

For what it's worth, I think Hirschhorn's examples demonstrate that Goldstein is overstating his point here; clearly common variants can be highly informative about biology, and it seems likely that we will find plenty more such examples as we dig into the the hundreds of genes uncovered by GWAS. Hirschhorn is prepared to bet on it:

" the 2012 [American Society of Human Genetics] meeting, genomewide association studies will have yielded important new biologic insights for at least four common diseases or polygenic traits -- and efforts to develop new and improved treatments and preventive measures on the basis of these insights will be well under way."

Of course the question of whether such revelations will be common and powerful enough to justify the fiendishly high costs of ever-larger GWAS remains an open one.

Where I think Goldstein does have the upper hand is in his critique of the success of the second goal of GWAS: identifying markers that can be used to predict disease risk. At the current time, the common variants identified by GWAS contribute very little of value to individual disease risk predictions over existing clinical markers for most common diseases. Hirschhorn's response to this argument is rather muted and speculative; in fact, Goldstein himself provides the best counter-examples to this trend (GWAS for some drug response and infectious disease susceptibility traits, where common large-effect variants have been uncovered). But these counter-examples aside, Goldstein is perfectly correct that common variants have proved disappointing from a clinical predictive standpoint.

Some have interpreted this as meaning that personal genomics is dead; this is not the case. It simply means that health predictions from the current incarnation of personal genomics (with its single-minded focus on common variants) should not be relied on too heavily by consumers. Over the next few years, personal genomics will move with the science towards increasingly better predictions.

Moving beyond common variants 

Somewhat surprisingly, one of the major themes of Goldstein's review goes uncontested by Hirshhorn: the notion that it's highly unlikely that more common variants explain the majority of the remaining genetic risk. Instead, Goldstein bets (and I completely agree) on a substantial role for rarer variants with substantially larger effect sizes. I'm planning to expand on the theoretical argument for the importance of rare variants in an upcoming series of posts, but for now I'll simply repeat Goldstein's summary: "the efficiency of natural selection in prohibiting increases in disease-associated variants in the population" means that the variants with the largest individual effects on disease will tend to be rare.

I think it's important to note here that while Goldstein has been one of the most public voices noting the disappointing yield from GWAS, he is by no means a lone voice in the wilderness. Most of the researchers working on GWAS that I've spoken to are not increasing their sample sizes because they think common variants are the only source of disease risk; they're doing it because the technology for surveying rare variants is only just becoming feasible, while GWAS technology is extremely well-established and reliable, and there are still plenty of common variants out there to be discovered.

Nonetheless, as sequencing technology becomes cheaper you can expect to see an explosion of targeted gene sequencing studies looking for rarer risk variants (and finding them soon, I hope, since this was my top prediction for 2009!). At the same time, a second generation of GWAS will be performed using new chips targeting variants throughout the genome at ever-lower frequencies. The two strategies will finally converge on the holy grail of genetic analysis: complete genome sequencing of hundreds to thousands of disease patients and controls.

These studies will hopefully yield a fine harvest of rare disease-associated variants with much stronger effects on risk than the common variants uncovered to date - increases in risk of between two- and ten-fold. Such variants will provide fine-grained insights into disease pathways, but more importantly they will be much more useful for individual risk prediction - if an individual is unlucky enough to be carrying just one or two of them they will instantly have a substantially higher risk than non-carriers. The emergence of these variants will make personal genomics vastly more useful for health predictions.

That's enough for now - hopefully I'll have time over the weekend to discuss the third NEJM article, and expand on its implications for personal genomics. 


  Subscribe to Genetic Future.


More like this

I wrote a few days ago about a debate in the New England Journal of Medicine over the value of data emerging from recent genome-wide studies of the role of genetic variation in common human diseases and other traits. David Goldstein argued that genome-wide association studies (GWAS) have generated…
Nature Genetics has just released six advance online manuscripts on the genetic architecture of complex metabolic traits. The amount of data in the manuscripts is overwhelming, so this post is really just a first impression; I suspect I'll have more to say once I've had time to dig into the juicy…
Well, it's a little late, but I finally have a list of what I see as some of the major trends that will play out in the human genomics field in 2009 - both in terms of research outcomes, and shifts in the rapidly-evolving consumer genomics industry. For genetics-savvy readers a lot of these…
The genome-wide association study has been the technique du jour in human genetics for much of the last two years. It's a pure brute force approach, surveying up to a million sites of common variation throughout the genomes of thousands of people at a time, some of whom suffer from a particular…

The money in disease prediction is in things like framingham calculators embedded in a PHR via web 2.0 Not in b.s. systems which predict next to nothing.


Both Goldstein and Joel Hirschhorn are correct. Classical quantitative genetic analysis suggests that there may be three classes of traits: a) simply inherited ones, b) governed by oligogenes (few genes with large effects) and c) polygenic (many genes with small effects). Difference between b and c could be blurry. I think Goldstein is largely correct for traits that are governed by polygenic mode of inheritance.For example, height, weight, longevity etc. Joel Hirschhorn is correct if these are under oligogenic mode of inheritance. Although Newton Morton published a very useful paper on these three modes -most people do not reference it. Prior to Morton there were pioneers such as Kenneth Mather who discussed these issues in detail. The present discoveries will give us a vague idea about the genetic architecture of traits, as they will not give ALL the genes governing a given trait. But, will they give us any predictive value? The answer is a qualified yes. Have we learned anything from knowing the effective factors (genes) underlying a trait? yes and no ...On the other hand, phenotypic, biochemical and developmental traits in relation to individual's context in a given family would probably provide better indices toward predicting the development of complex traits, including diseases.

By Diddahally Gov… (not verified) on 17 Apr 2009 #permalink

Sure - integrating other predictive variables with genetic information is where the real interest lies. How long do you think it will be before 23andMe do this? Will mainstream medicine do it any better than they will?

What is it, something like $7000 per person per year is spent on medicine on the US?

Over 70 years that adds up to ~$500,000; and that number is increasing.

Would a $5,000 genome be justified if it could reduce medical costs by 1% over a lifetime?

How about a $1,000 genome reducing medical costs by 0.5% in a lifetime?

How about a $100 SNP chip that reduced medical costs by .05% in a lifetime?

So no, the substitution isn't genetics substituting family history (duh);
it's genetics and family history substituting 1 major procedure or hospital stay in a lifetime (at a minimum).

It's about genetics substituting leisure and luxury as people try to maximize their lifespans and the quality of those extra years.

The debate isn't between genes and family history; they are synergistic.

Can we have the real debate please?

By Paul Jones (not verified) on 19 Apr 2009 #permalink

There is a problem in the "culture" here. Answers to some of these questions are found in classical evolutionary biology and plant and animal breeding (which may be defined as controlled or directed evolution). Neither 23 and me (albeit I know a few of them personally) nor medical scientists are least bothered to incorporate these insights into their work. I think it's here that evolutionary principles and insights would become eminently useful in medicine as well as understanding these intricacies in managing human health. The very first chapter in Lewontin's (1974) book "The Genetic Basis of Evolutionary Change," is still a beacon light to approach some of these questions.

By Diddahally Gov… (not verified) on 19 Apr 2009 #permalink


Great comments. I'm actually struggling to bring myself up to speed on quantitative genetics at the moment, and will hopefully have more to say about it in a couple of months. In the meantime, you really should start your own blog to talk about this stuff!


No disagreement from me.

Daniel and Paul:

Thank you for encouraging me to start my own blog. But, there is already one by Dr. Randolph Nesse of the University of Michigan ( which could eminently serve this purpose.

Harboring some of the discontents concerning the basic tenets of GWAS (starting with Andrew Clark, Kenneth Weiss Jonathan Pritchard etc), and other questions that surround medicine and human health, some of us started lively discussions on this and related topics under the general title, "Evolution in Health and Medicine," in 2007. These discussions have culminated in a very recent and a successful Sackler Colloquium on "Evolution in Health and Medicine." The colloquium was sponsored by both the National Academy of Sciences and the Institute of Medicine. A news item on the colloquium has appeared in Science (324:162-163), titled, "Evolutionary Medicine: Darwin Applies to Medical School." We strongly believe that evolutionary biology could greatly benefit from the wealth of data that are emerging from the most intensively investigated model organism - humans. On the contrary, medicine could benefit from the diverse and time-tested insights that evolutionary biology brings to improve/manage human health.

By Diddahally R. … (not verified) on 19 Apr 2009 #permalink

This would be the smartest way to go. I know Coriell is working to add these to their risk prediction tools. Which is where everyone should be going. I know a couple companies like this. Check out, turns out the Malaysians are way a head of Wojicki



I enjoy your blog thoroughly. Diddahally's previous post is essentially right on the mark. Another aspect of the effectiveness of GWAS, which seems to be ignored, is the effect of the environment. Decades of quantitative genetics research in plants and animals demonstrates that the phenotypic effect of a given genetic variant is dependent upon the particular environment in which the phenotype is studied. This holds true even for some major genes.

I think it's safe to assume that most complex diseases involve many genes of small effect, the effect of the environment is quite high, and there are other genetic forces at work which will confound researchers (i.e., epistasis). Since carefully constructed genetic populations are impossible to make with humans, it's highly unlikely that either of the two schemes (GWAS or whole-genome studies) will be able to uncover most of the genetic architecture of complex disease anytime soon. To expect more than this is quite naive - there's simply too much human genetic variation and too many different environments. Perhaps rather than attempt to discover all genes involved in a particular disease across all humans, a population-specific approach may be more fruitful.

By Matt Kinkade (not verified) on 22 Apr 2009 #permalink

Mr. Kinkade is absolutley right. Once again, the great quantitative geneticist the Late Falconer wrote this great paper nearly sixty years ago:

Falconer, DS. 1952. The problem of environment and selection. The American Naturalist 86:293-298.

I am sure that Mr. Kinkade had the above paper in mind.

Decade later James Neel in the late 70's also recognized these questions in a different way - he called population specific alleles as 'private alleles/polymorphisms." Hope some of these ideas will get their due share in the contemporary human genetics literature.

By Diddahally Gov… (not verified) on 23 Apr 2009 #permalink

I think you are probably right on the issue with the rare variants.
We will probably have to wait till the completion of the 1000 genomes project to include the variants that are 1% or more frequent into the screening. The current test rely on 10% or more (HapMap) I think, so they are bound to find only the common variants. After the 1000 genomes project is completed, wait for 2-3 years and we will have a further wave of discovery of new disease susceptibility variants, possibly with a higher hazard ratio than the ones we already know.

As far as "high throughput sequencing to find susceptibility genes" is concerned, there already is such a study:
Science 10 April 2009:
Vol. 324. no. 5924, p. 217
DOI: 10.1126/science.1171202
Exomic Sequencing Identifies PALB2 as a Pancreatic Cancer Susceptibility Gene

Sorry about that, I probably first read about it on your blog and then forgot. I did read the paper as I work on cancer genomics myself and found it quite interesting.

The reason why we are unable to define life is that our concept of particulate gene is wrong. If we treat biological program based on the non-particulate gene concept originally proposed by Wilhelm Johannsen in 1911, we will be able to define and explain life. In accordance with the intangible bioprogram, a computer model of the organism has been proposed (see the book The Computer Universe published by Adam Publishers, New Delhi). An organism is natural biocomputer with hardware (all the chemical structures including genome in the cell, tissues and organs) and biosoftware (intangible biological program). The bioprogram is not constituted by genome but it is non-particulate like our computer programs. But it needs a physical medium for storage as in our computer and robots. The biosoftware is stored on the chromosomes, which is the memory storage (hard disk), perhaps by a similar mechanism as information is stored in brain memory.

The cells of a dead body have all their structures including genome intact; yet they do not exhibit life! This fact about the particulate genome is opposed to the fundamental principle of chemistry. How can a structure lose its properties; in this case, information encoded by its structure?

In the computer model of the organism, life is defined as the manifestations of the execution of the intangible bioprogram (stored on the chromosome), and death as the deletion of the bioprogram from the cells. A dead body is like a computer without software. The soul mentioned in Scriptures is nothing but the non-physical biological program stored on the chromosomes. The forms of artificial life are our computers, software-based toys, robots, etc.

"It's the linkage, stupid."

The problem: Human Genetics is practiced by statisticians, not Geneticists. Statistical association of alleles and phenotype will never provide clarity.

Once genome sequencing becomes sufficiently inexpensive forget about LD. Instead, map meiotic crossovers in families and measure genetic linkage.

By C. Kloess (not verified) on 17 Jun 2009 #permalink

How would you propose that we do family linkage studies for variants conferring per-allele odds ratios of 1.1? I was under the impression that linkage was essentially impossible for anything with an OR less than around 5.

Do you want to detect marginal contributors to a phenotype or the major loci? What if one (or a few genes) makes a major contribution but is not a common allele? At some frequency they arise, but are prone to loss because of a significant but minor detrimental phenotype. GWA would be blind to these or underestimate the contribution.

By C. Kloess (not verified) on 24 Jun 2009 #permalink

I agree with everything you just said, but you didn't answer my initial question: how can you do a linkage study for a variant with an odds ratio of around 5 (which is what we might reasonably expect for most of these rare moderate-effect variants)? In particular, how do we do it for disease where the background rate of disease is as high as 20% (e.g. type 2 diabetes)? In these cases many family members could suffer from the disease without carrying the variant.

Looking for rare variants is clearly important, but I don't think the answer is linkage analysis: as I said in the post, the best way forward seems to be by moving towards large-scale (and eventually whole-genome) sequencing studies in very large cohorts of unrelated individuals.