Pickrell, J., Coop, G., Novembre, J., Kudaravalli, S., Li, J., Absher, D., Srinivasan, B., Barsh, G., Myers, R., Feldman, M., & Pritchard, J. (2009). Signals of recent positive selection in a worldwide sample of human populations Genome Research DOI: 10.1101/gr.087577.108
I pointed yesterday to a new paper in Genome Research taking a genome-wide look at the signatures of recent natural selection in a worldwide sample of humans.
I promised a more thorough analysis of this paper today, but I see Razib at Gene Expression has already done a fine job of that. Razib's post covers the bulk of the most important findings of this paper in detail, so you should go read it now; I'll really just be expanding on what I see as some of the most interesting nuggets of data.
I also mentioned the paper's rather indirect critique of John Hawks' "recent acceleration" hypothesis, which proposes that humans have experienced very rapid evolutionary change over the last 40,000 years. John Hawks responded to that critique last night, pointing out that the paper does not explicitly test the acceleration hypothesis and that its major findings are in fact consistent with his theory. The paper's lead author, Joe Pickrell, has a quick comment on my post yesterday clarifying his position.
Now, onto what I see as some of the most interesting results from the paper.
Different populations show different signals of selection
This isn't a new finding, but it's much more striking in this study compared to previous analyses due to the massively increased number of populations studied. Basically, this tells us that different human populations have responded to their local environment in different ways - either because their environments were different, or because they had different genetic variants available to fuel the process of adaptation. In other words, not all humans share the same evolutionary history.
This figure from the paper (which I've reformatted slightly) shows the degree of sharing between the top 10 signals of selection from each of the 8 broad population clusters defined in the paper (from top to bottom: Biaka Pygmies, Bantu speakers, Europe, Middle East, South Asia, East Asia, Oceania and the Americas). The colour of the boxes ranges from red (strong evidence for selection) to white (no evidence). There is considerable sharing between Europe, the Middle East and South Asia, but the top hits in the other populations tend to be largely restricted to that group:
This pattern is even clearer in some of the expanded Supplementary Figures (see the example right at the end of the post).
Some of the population differences make perfect sense. The fact that the genes underlying skin pigmentation have been under different selective pressures in Africans and Europeans, for instance, is readily apparent from the strikingly different skin colours of individuals from these populations. What scans for selection (and other evidence) suggest is that these local adaptive differences go deeper than skin colour, likely affecting many different aspects of human biology. Of course that would come as no surprise to most mainstream biologists.
Despite the broad-scale differences between continental groups, the authors found little evidence for differences in targets of selection between closely-related populations; in other words, populations that live close together and share relatively recent common ancestry tend to have experienced similar selective pressures. However, the team did identify signals of highly local adaptation in a few genes, mostly involved in the immune system - presumably reflecting adaptation to geographically restricted infectious diseases.
Regions associated with type 2 diabetes risk show evidence of positive selection
The study looks at regions associated with a whole range of common diseases and other traits (e.g. height), but doesn't find much of a striking signal for any of them. For type 2 diabetes, however, there is evidence that the regions associated with disease risk are also significantly more differentiated than expected between African and non-African populations - a pattern suggestive of recent adaptive evolution. Several of these regions also show linkage-based signals of selection (see below).
What does this mean? It's hard to say precisely, and the authors avoid speculating too wildly about the implications. Because the precise genetic variants that alter type 2 diabetes risk in these regions are yet to be identified it is difficult to determine if selection is acting on these variants, or on other independent variants in the same gene. Still, this is a tantalising clue to the evolutionary origins of one of the most common modern diseases, which I'm sure we'll hear more about in the near future.
We don't understand the function of most genes under selection
As is the case for recent genome-wide association studies for common diseases, the majority of the signals emerging from this study localise to regions that contain either no genes, genes of unknown function, or genes with no obvious link to recent human adaptation. Although the functional basis for some of the signals is clear (e.g. pigmentation genes), most of them currently defy explanation.
A good example is the region that emerges as one of the clearest regions of positive selection in non-African populations, which contains one protein-coding gene and three non-protein coding RNA genes. The protein-coding gene, C21orf34, is just one of the thousands of functionally uncharacterised genes in the genome - essentially nothing is known about its biological role. There are no known genetic variants in any of these genes that could explain the striking evidence for recent selection.
That's the beauty of unbiased genome-wide scans: you don't need to have a hypothesis to find something interesting. The data from this study will serve to guide further downstream analyses exploring the function of the genes in human biology and recent adaptive change.
Power to detect recent selection is still far from complete
Most genome scans for positive natural selection work by looking for unusually strong patterns of association between genetic variants stretching over a long region of the genome. These patterns of association (called linkage disequilibrium) tend to decay over time through the process of recombination. That means that you can use the length of the region of strong association as an indirect measure of how old a variant is; if you find something at high frequency that looks very young, it must have increased in frequency very rapidly and recently.
There are two possible explanations for a variant increasing in frequency very rapidly. The boring explanation is pure chance: random genetic drift, facilitated by demographic changes like population bottlenecks. The more interesting explanation is that the variant increased the reproductive fitness of the individuals that carried it, and thus increased in frequency through positive natural selection.
One of the nice things about this study is that the authors have explicitly examined the power of their algorithms to discriminate selection from the random noise of genetic drift. Here's a figure from the Supplementary Data based on some complex simulations to estimate the power of their two linkage-based methods to detect positive selection:
These two methods are the integrated haplotype score (iHS; top) and cross-population extended haplotype homozygosity (XP-EHH) tests. The authors have simulated the power of these tests to detect positive selection on a variant with a selective advantage of 1% in three populations: East Africans (YRI), Europeans (CEU) and East Asians (ASN), for genetic variants at various frequencies in these populations (frequency is the horizontal axis).
There's a lot that could be said about these graphs, but I'll just make two points: (1) the tests are nicely complementary, with iHS having maximum power for variants at around 70% frequency whereas XP-EHH is well-powered for very high-frequency variants; and (2) even so, there are a lot of positively selected variants that these tests would miss. In East Asia and Europe, for instance, both tests would miss a large majority of selected variants with a current frequency below 50%. That means that extremely recently selected variants in these populations (which are still at a low frequency) would be essentially invisible to these tests.
This problem is especially acute for populations that have been subject to very strong recent bottlenecks (e.g. Native Americans), where the noise arising from the bottleneck can largely confound signals of selection.
All this means that there are a lot of signals of selection out there yet to be found. Increasing sample sizes and exploring more varied populations will help a little, but will bring diminishing returns; for low-frequency selected variants there may well be no feasible way to distinguish them from background noise.
Possibly the most successful strategy will be combining signals from these types of scans with functional information to detect clustering of weak signals in particular biological pathways; this study uses this type of approach to find a compelling signature of selection acting on the NRG-ERBB4 pathway in non-African populations.
Anyway, I gather that a second paper on the same data-set is also awaiting publication, which will have more juicy data to explore. I'll also be following the dialogue between John Hawks and the authors of this paper with some interest.
As promised above, here's the expanded signal-sharing chart for Bantu-speaking Africans from the paper's supplementary data; the extraordinarily low degree of sharing (even with the other African cluster, Biaka Pygmies) is readily apparent:
Under your "We don't understand the function of most genes under selection" section, you don't mention that these non-protein coding genes are mir-99a, let-7c and mir-125-b2. The microRNA, let-7, is incredibly well conserved within metazoa (eg. see Rfam or miRBase). It seems very likely to me that positive selection signal could be due to the miRNAs rather than some dodgy gene of unknown function. However, I'm notoriously biased when it comes to protein vs ncRNA issues.