Nature Genetics has just released six advance online manuscripts on the genetic architecture of complex metabolic traits. The amount of data in the manuscripts is overwhelming, so this post is really just a first impression; I suspect I'll have more to say once I've had time to dig into the juicy marrow of the supplementary data.
The general approach of exploring the genetic architecture of quantitative disease-associated traits (often called intermediate phenotypes or endophenotypes) rather than categorical case-control analyses of disease status raises some interesting questions, but I'm going to save those for another time and focus instead on the results.
For those who don't want to wade through the details, here's the gist: these papers use genome-wide association data from very large numbers of individuals to analyse the genetic architecture of disease-associated traits like blood lipid and glucose levels (rather than looking for associations with disease status itself). They find a bewildering array of new variants associated with these traits, and indications of many more genes still to come as sample sizes increase. However, for most traits, despite very large sample sizes the majority of the variance is not explained by the identified variants.
For those who are interested in more details, I've tucked them below the fold:
The genetics of blood glucose levels
Two of the manuscripts (here and here) describe genome-wide association studies for fasting plasma glucose levels, and both identify a significant association with variants close to a gene (MTNR1B) encoding a melatonin receptor in addition to validating previously identified associations with variants in other genes. A third paper explores possible mechanisms for this association, showing that the risk variant increases the expression of the melatonin receptor in the pancreatic cells responsible for secreting insulin. All three of the studies demonstrate that the version of this variant associated with higher blood glucose is also associated with type 2 diabetes (unlike previous glucose-altering variants), although the effect is tiny - just a 10% relative increase in risk.
The combination of trait and disease associations with functional data shows the way forward in human genomics - this is science that is both elegant and large-scale. However, there's also the bad news: one of the studies examined fasting plasma glucose in a whopping 36,610 individuals, and yet the variants discovered explain just 1.5% of the variance in glucose levels. Clearly there is much more work to be done to dissect out the genetic architecture of this trait.
The genetics of blood lipids, and other metabolic traits
The second batch of three papers look at the genetic architecture of yet another set of important metabolic traits: blood lipid levels. Two massive (as in, discovery panels of ~20,000 samples) genome-wide association studies report a plethora of genes associated with blood concentrations of total cholesterol, high- and low-density lipoprotein and triglycerides.
It will take me a while to mine the data, but I count 18 new regions identified by the two studies, with only one new region overlapping between the two. The weak overlap is likely due to the fact that these studies are now reaching into the realm of common variants with exceptionally small effect sizes, meaning that any individual variant has only a small chance of being picked up in the discovery phase of even these huge studies - and that in turn suggests that building up even larger cohorts will yield further common, small-effect risk variants.
The final study is not huge by modern GWAS standards - with a meagre 4,763 participants - but in my mind it is the most interesting. The study looked at virtually everyone born in two provinces of northern Finland in 1966; in addition to looking at genome-wide patterns of common genetic variation, the researchers also assessed environmental influences (like smoking and oral contraceptive use) and an array of nine heritable quantitative traits with strong disease relevance: body mass index, blood levels of cholesterol, lipids, glucose, insulin and the inflammation marker C-reactive protein, and blood pressure. In addition they carefully controlled their study for the effects of geographic ancestry - see my previous two posts for some details (and a cool map) of the genetic structure of this cohort.
This approach allowed the team to examine the genetic landscape of these nine disease-relevant traits in a small but extremely well-controlled population. Here's what they found:
The figure shows a "Manhattan plot" for each of the nine traits, with each dot indicating a different variant examined by genome-wide association (with chromosomes coloured in alternating black and grey), the y-axis indicating a measure of the probability that a variant is associated with that trait, the horizontal red line indicating the threshold for formal statistical significance, and the vertical blue lines showing regions with significant associations in previous studies.
You can see immediately that although there is substantial overlap between the associations in this study and previously reported signals (blue vertical lines), this overlap is far from complete, and certainly less extensive than I would have expected. This study both identifies nine new associations with no previous formal statistical support, and fails to replicate several signals with strong associations in published studies. This heterogeneity likely reflects a combination of the unusual study design, the relatively small sample size and the use of the Finnish population, which will have both a slightly different set of risk variants and a different correlation structure around shared risk variants compared to the broader European population.
The other interesting message from this graph is that genetic architecture varies substantially between traits. This is not a new discovery, but it's great to be able to see these nine traits lined up against one another in a single cohort to emphasise the differences. You can see immediately from the Manhattan plots that some traits (lipid, CRP and glucose levels) have multiple association signals towering over the Magical Red Line of Significance, whereas others (insulin, BMI and blood pressure) show almost nothing - bear in mind that the y-axis changes between traits.
One of the nicest aspects of this study is that there's no shying away from reporting on the small fraction of the overall variance explained by the discovered markers. In fact, the authors provide a handy graph to quantitate the degree to which their identified genes can explain variation in seven of their traits (the two blood pressure traits are not shown, since the study didn't identify even any significant variants associated with these traits):
The length of the grey bars indicates how much variance can be explained by a model combining genetic data with other variables like smoking, BMI, and geography (see the table at the end of the post for a list of the variables used for each trait). The red bar indicates the fraction of the variance explained by the genetic predictors. As you can see, the genes discovered so far typically explain less than 6% of the variance for any of these traits, with some traits (e.g. BMI, insulin) performing substantially worse, and of course blood pressure failing to yield even a single significant variant. This poor performance is despite strikingly high heritability for many of the analysed traits - typically above 50%, with estimates ranging as high as 90% for traits like serum LDL.
The disappointing predictive capacity of the common risk variants identified by GWAS is becoming almost too mundane to mention, and the authors adopt an upbeat tone here. Rather than mumble the usual excuses for the missing variance, the authors instead note that the population used in this study may be ideal for future discoveries of one of the potential sources of the missing variation: rare variants. The common variants targeted by current GWAS technologies are likely to be quite ancient, and thus to probably be found in most European populations. In contrast, rare variants will typically be young and relatively geographically restricted - so finding them will be substantially easier in the sort of homogeneous, relatively inbred population analysed in this study.
The authors even have proof of principle, as one of the novel variants that fell out from this study was a rare marker close to the androgen receptor (AR) gene associated with LDL levels - the risk version of this marker has a frequency of just 1.7% (well below the magical 5% threshold that the current generation of chips was designed to capture). The only reason it was picked up by the study was that it had a comparatively strong effect, substantially larger than that seen for the other common LDL-altering variants identified in this population. Unsurprisingly, this variant has not been identified in previous scans for LDL genes that usually pooled individuals from various European populations together.
This rare variant provides a small taste of experiments to come: no doubt we will see this particular cohort, and others like it, mined extensively over the next few years with large-scale sequencing and other emerging tools in an effort to identify and characterise the sources of the missing variance.
P.S. As promised, the table of variables used to predict the seven traits in the histogram above:
The variable "Sex OC PG" incorporates the combined effects of sex, oral contraceptive use and pregnancy status; C1 and C2 are the first two principal components of the genetic clustering analysis, used as a proxy for geographical ancestry.
Chiara Sabatti, Susan K Service, Anna-Liisa Hartikainen, Anneli Pouta, Samuli Ripatti, Jae Brodsky, Chris G Jones, Noah A Zaitlen, Teppo Varilo, Marika Kaakinen, Ulla Sovio, Aimo Ruokonen, Jaana Laitinen, Eveliina Jakkula, Lachlan Coin, Clive Hoggart, Andrew Collins, Hannu Turunen, Stacey Gabriel, Paul Elliot, Mark I McCarthy, Mark J Daly, Marjo-Riitta JÃ¤rvelin, Nelson B Freimer, Leena Peltonen (2008). Genome-wide association analysis of metabolic traits in a birth cohort from a founder population Nature Genetics DOI: 10.1038/ng.271