StefÃ¡nsson says he is "convinced that the reported association between exceptional longevity and most of the 33" variants found in the Science study, including all the variants that other scientists hadn't already found, "is due to genotyping problems." He has one more piece of evidence. Given what he knows about the 610-Quad, he says he can reverse-engineer the math in the BU study and estimate what fraction of the centenarians were analyzed with that chip. His estimate is about 8 percent. The actual fraction, which wasn't initially provided in the Science paper, is 10 percent, the BU researchers tell NEWSWEEK. That's close, given that StefÃ¡nsson's calculations look at just two of the variants found in the study and there may be similar problems with others.
Still, one has to wonder how the paper wound up in Science, which, along with Nature, is the top basic-science journal in the world. Most laypeople would never catch a possible technical glitch like this--who reads the methods sections of papers this complicated, much less the supplemental material, where a lot of the clues to this mystery were?--but Science's reviewers should have. It's clear that the journal--which hasn't yet responded to the concerns raised here--was excited to publish the paper, because it held a press conference last week and sent a representative to say as much.
Excellent post. I am afraid, What the public "wants" to hear
"We can find out if you will live to 100"
Trumps good science or good medicine. Further leaving everyone with a bad taste in their mouth about GWAS developed tests for clinical use.
I can't believe this asn't picked up in Science.....It would have been in NEJM.
Daniel, thanks for the excellent commentary.
"If the key results from this paper do turn out to be based on easily-detected experimental artifacts, Science deserves to be embarrassed."
I think that Science should be embarrassed whatever the outcome, there are so many issues with this paper and the latest is your revealing Manahattan plot, that it should never have been accepted without further verification. Especially by Science. Even if these results turn out to be correct, if they keep on accepting papers with this level of uncertainty, a large proportion of them will be false.
Lucid and useful commentary!
Then there's that 77% figure.
(1) Results from sophistimacated multi-locus models like those used by Sebastiani et al. will be even more susceptible to artefact caused by subtle (or not so subtle) measurement error at many loci. This is already a major worry with the rather simplistic "polygenic analyses" like those in last year's Nature paper on schizophrenia that simply sum up risk alleles. The model used here is far more flexible, which is a double edged sword--sure you might find some bizarre multi-locus combination that increased your chance of long life, on the off chance that there is one, but you're also much more likely to find odd artefacts, which surely do exist.
(2) Even taking their model as true, that 77% number is meaningless outside the context of this study. That number depends on how many people in your sample have the trait you are trying to predict. The replication sample was about 50% centenarians. So 77% sounds impressive, but the appropriate baseline is about 50%, which is how accurate Paul the octopus would be. Plus that number has no meaning for somebody thinking about their chance of living a long time. I was glad to see Kari and David jump on this point in the Newsweek article.
Yes, if true [a big if], the fact that a model using these markers does better than guessing at random shows that these markers are in combination associated with longevity--but that's all it shows. Folks have to be very careful with language about "predictive accuracy" because there are so many different metrics, most of which are not immediately applicable in the context of personalized medicine, which is very likely how they will get picked up in the popular press, as Sebastiani et al. have learned. I've heard them on local radio arguing against using this signature to predict one's probability of living to 100, but that may be trying to close the barn door after the horses are out.
Excellent post. From the looks of things, any reviewer with the slightest familiarity with GWAS would have instantaneously seen the problems with the data even with a superficial look at the figures. This suggests to a high degree of likelihood that Science editorial staff failed miserably in getting appropriate reviewers for this paper.
The analysis proceeded using a Bayesian model which built on the most significant SNP, rs1036819, which was wrong. See the supplementary material, page 10 (page #9): http://www.sciencemag.org/cgi/data/science.1190532/DC1/1
That's probably highly significant because of the genotyping artifact. And many of the other SNPs identified probably are as well.
It's a bit like setting out from Nashville on a trip to NYC by going West on I-40...with a broken GPS.
The plot you referenced: what platform was that WTCCC data generated on?
The WTCCC used Affymetrix chips (the old GeneChip 500K Mapping Array Set, to be precise).
Excellent commentary, as always. I've edited the post to point people to your discussion.
Good post. I agree with Peter Kraft in that the "77% accuracy" of the paper is misleading, but I disagree somewhat with his explanation of why. The authors claim 77% specificity and sensitivity and hence 77% accuracy (probably as defined by http://en.wikipedia.org/wiki/Specificity_(statistics) ). One of the useful features of these statistics is that they are all independent of penetrance (here the probability of reaching 100), e.g. sensitivity is the fraction of cases correctly predicted.
The misleading bit is that when you have a low penetrance (probability of reaching 100 is probably between 1/1000 and 1/10,000) then having specificity of 77% is not particularly useful. Assuming a penetrance of 1/1000, my calculations say your chance of reaching 100 will grow to 3.3/1000 if you are predicted to be longevious (the positive predictive value) and fall to 0.3/1000 if you are not (1-the negative predictive value). Not many people, apart from statisticians, would call that 77% accurate.
Thanks Daniel G, I had missed the fact that what they call "accuracy" is the sensitivity and specificity of their test. I'm used to the convention where accuracy is defined rather intuitively as the marginal %age of predictions that are correct--i.e. if you randomly select somebody from the population (blind to their outcome and their genotype), accuracy is the probability their predicted outcome matches their true outcome. Sensitivity and specificity are typically referred to as measures of "discrimination"--a way of summarizing how test results differ between those with the outcome and those without.
As you point out, having seemingly high sensitivity and specificity do not guarantee a high probability of having the outcome if you test positive. This is a general property of conditional probabilities, discussed here: http://opinionator.blogs.nytimes.com/2010/04/25/chances-are/.
To be fair to me ;-) it is rather odd to report sensitivity (the probability somebody with the outcome tests positive) and specificity (the chance somebody without tests negative) as one number. They need not be the same--and in fact they rarely are (they are different in the longevity training set, for example). It was rather serendipitous--and ultimately a little confusing--that they could be reported as a single number.
The other odd thing here is sensitivity and specificity make sense for binary tests, where one tests positive or negative. Fancy genetic risk algorithms like the one discussed here don't return a binary prediction: they return a continuous probability, somewhere between 0 and 1. Sebastiani et al. used an arbitrary threshold to define positive and negative tests (predicted risk higher than the proportion of cases in the sample was a positive test--using this threshold Paul the octopus' risk model would have 43% sensitivity and 57% specificity). But there are other thresholds (anywhere from 0% to 100%), and each will give a different sensitivity and specificity. This is why the discriminatory ability of a test is usually presented as a receiver operating characteristic or ROC curve (plot of sensitivity versus 1-specificity over the range of possible thresholds--available for the longevity model in the supplementary materials). The area under this curve (aka the C or concordance statistic) is often used as a one-number summary of the discriminatory ability of a test. This is the figure 23andMe could not replicate.
Although it is something of a workhorse, there has been a lot of discussion of the shortcomings of the ROC curve in the biostatistics literature lately as a way of summarizing a predictive medical test. (This discussion is not restricted to genetic tests.) For example, since the C-statistic averages over all possible thresholds, it weights what may be implausible decision rules like "treat everybody" or "treat nobody" equally with more plausible strategies (only treat folks where the benefits of treatment outweigh the risks). Of course what constitutes plausible depends on context--those risks, benefits and costs.
To reach 100, after many years of looking and feeling younger, faster and stronger, do what many healthy centenarians do--eat a low-fat diet with lots of grains and legumes. All seeds contain the simple glucose isomer Inositol, which activates the same genes as long-term caloric restriction (J Barger, 2008). My father reached 99 on oat porridge. Check out anti-ageing among your friends and acquaintances--it is happening right now, in low-fat vegetarians, and also in omnivores who do whole-grain breakfasts. These folks look years younger than their age, and have unusual energy and endurance (Inositol activates the master gene for mitochondrial biogenesis and cellular energy production, PGC 1 alpha. It's as simple as that.
Very instructive post, Daniel.
What struck me the first thing was that single-locus results were downplayed, and the paper appeared to zoom in on aggregate analysis. I'd always thought that single-locus SNPs must be replicated exactly before any form of multi-variate / multi-SNP score performed.
a manhattan plot from an ILMN gwas chip would be more convincing. Although it is a classical platform, we know the affy 500k chipset, used by WTCCC, has highly correlated SNPs in some regions while miss some other regions. Assays on the ILMN chips are expected to be more evenly distributed and less correlated.
Did anyone notice the new notice added to the beginning of this paper?
"Following publication of their paper in Science Express, Sebastiani et al. were made aware of an inherent defect in the 610-Quad chip that they used to genotype 7% of their discovery set (60 of 801 samples) and 17% of their replication set (44 of 254 samples). This defect may have led to incorrect genotyping of some of the SNPs that Sebastiani et al. used to build their genetic classification model for exceptional longevity. The authors are reanalyzing their data to determine the extent to which the genotyping errors affect their classification model.
The Abstract has been edited for clarity. Sentence 3 in the original Abstract has been replaced with âUsing these data, we built a genetic classification model that is based on 150 single-nucleotide polymorphisms (SNPs). When we applied this model to an independent set of centenarians and control individuals of average longevity (AL), we found that it correctly classified individuals into the EL or AL group 77% of the time.â "
Yes, that's been there since just after this controversy erupted (but hasn't been commented on publicly, so thanks for adding it to the thread).
One can't help but wonder, after three months, just how much longer the reanalysis will take - but rest assured there are several people who are actively pushing to get hold of the final results...