[Previous installments: here, here, here, here]
We'd like to continue this series on randomized versus observational studies by discussing randomization, but upon reviewing comments and our previous post we decided to come at it from a slightly different direction. So we want to circle back and discuss counterfactuals a little more, clarifying and adapting some of what we said for the coming randomization discussion.
Let me change the example to a more recent controversy, screening mammography for breast cancer. Should women under 40 get routine screening given that there is said (on the basis of RCTs) to be small benefit on the one hand and on the other, putative or real risks? I don't want to get into the actual details of this dispute but just look at the logic without concerning ourselves with the size of the benefit. Let's simplify to asking if there is any benefit to mammography at all. This is like the question about our blood pressure drug: does it work? I'm changing the example because I think what is involved is easier to see when we talk about screening.
Consider what this means in the case of Jane Doe who is considering whether to get a mammogram or not. She wants to know if it will prolong her life in quantity and quality. Let's do a thought experiment. The ideal experiment would be this. We screen Jane Doe, follow her for a number of years and then use an appropriate measure of outcome, let's say age at death (we are not only including breast cancer death but risk of death from any cause). Then, we turn the clock back and not screen Jane Doe and do the same thing and compare them. If they are different, most of us would say that's what it means for "mammography to work."
In the real world, though, only one of these scenarios can happen. The thing that gives it meaning for causality, the comparison with the impossible anti-trial, can't happen. This is a conundrum. We need a work around. It won't be perfect but it's the best we can do. We will claim that there isn't just one work around but many and you use what you can in terms of feasibility, ethics and resources.
So what are the work arounds for what seems an insoluble situation? Jane Doe can't be both screened and not screened. They are mutually exclusive. For screening, unlike our blood pressure example, there is no possibility of using a cross-over design, i.e., first screening her (or not screening her) to see what happens for a few years and then not screening her (or screening her) and then seeing what happens for several more years. Her risk changes with time so the later Jane Doe is clearly not equivalent to the earlier one. We don't have an identical Jane Doe. Even if she had an identical twin, Alice Doe, Jane and Alice will have the same genome but different histories, making them differ in many ways (e.g., one might be a radiologist and the other an accountant; one might live in Denver, the other in Charleston, SC). Not even their genetics will be identical because the genome becomes modified after birth. These "epigenetic" changes are like changing the Preferences on a software program. The underlying program is the same but two users might set things up quite differently after opening the shrinkwrap. The problem with a whole bunch of Jane Does (a population) is no different than one Jane Doe. We can't both screen them and not screen them. At this point many of you will want to have two populations, one screened and one unscreened and compare them. That's obviously where we are heading, but before we do, let's stay with the counterfactual problem just a bit longer.
What if we could turn the clock back for a population of Jane Does? What would we look at? We want something that measures the relevant (for our purposes) differences between the screened and unscreened population. Epidemiologists are adept at finding these measures. It might be total mortality after a suitable follow-up period (incidence proportion), survival after screening, breast cancer mortality per person year of observation (incidence density), etc. Which one we choose may be subject and setting specific and let's not worry about which one we settle on. We are only interested in the difference in the measure between the screened and unscreened population after turning the clock back. That difference is called a causal contrast (or effect measure(!) or causal parameter). Say we are using the arithmetic difference as the causal contrast. Let's choose risk of dying from breast cancer and call the risk when the population is screened R1, and R0 the risk when it isn't screened.
We can observe only one of these, however, because in the real world we can't turn the clock back. While we want to measure (R1 - R0), we can only observe one of R1 or R0, not both together, so we can't measure what we want, the causal contrast (the measure of effect). Faced with this, we try to do the next best thing: find a substitute for the unobservable counterfactual population (the one that was unscreened). So far we haven't said anything about randomization and for good reasons. This is a general framework for all kinds of studies about whether something works or causes disease (etiologic studies). In experimental situations, like clinical trials, the investigator gets to assign the treatment (screened or not screened, drug or no drug). In observational studies the assignments are given to us or the contrast is with some prior experience (e.g., Rind's HIV or rabies examples mentioned in our last post or our fictitious blood pressure trial). Study design differences are differences are related to how we choose the original population for the study question (the Jane Does), how we choose an appropriate substitute population for the causal contrast (the pseudo Jane Does), how well these decisions represent the world of Jane Does and then all sorts of ancillary decisions about related to costs, study time and other technical factors.
The sampling problem is an added complication. The study group of screened Jane Does is meant to represent the world of Jane Does for whom we want to know if screening is a good thing. The way we have chosen them might or might not be representative -- and here that means "good stand-ins" -- for that larger group. The same for the unscreened pseudo Jane Does. Let's call the larger group the target population. So there is a double substitution going on here. One is the pseudo Jane Does (unscreened) for the real Jane Does (screened). The other is both of these populations for the bigger external target population. None of this (so far) involves randomization. It just involves the notion of substitution of one population for another so we can observe something unobservable (in one case the entire population, in the other the counterfactual), thus allowing us to get a causal contrast. If either or both of these substitutes don't represent the target population or the counterfactual pair, then we run the risk of misreading the contrast. Epidemiologists call [the non comparability of the counterfactual substitute] confounding (or we say the causal contrast is confounded; this is a more general notion of confounding than seen in many textbooks but amounts to the same thing). What it means in plain language is that when we used an imperfect substitute we weren't really getting an accurate picture of what we would have seen if we'd been able to turn the clock back. We aren't seeing what we want to (but can't) see. Our "work around" was faulty.
Let's consider our blood pressure trial (see previous posts here, here, here, here). The substitute for the same patient but without treatment by the drug was in fact the same patient at a previous time when they were being treated with the usual therapies and not responding. This would seem to be a pretty good substitute, although one can conjure up reasons why there might be confounding (see the comment threads and also the post by David Rind in Evidence in Medicine). Most of those problems could probably be remedied with a technical fix for this unblinded single-arm trial and wouldn't require randomization. What is the target population? It could be all refractory hypertensives or just refractory hypertensives in this doctor's practice or none at all, just a report of what happened to these patients. So the double substitution is visible here, too.
A randomized clinical trial (RCT) is another kind of "work around" for the counterfactual problem. You still have to worry about how good the substitute for the counterfactual is and how representative a substitute the study population is for the target population. Thus when you study seasonal flu vaccine effectiveness in the elderly with an RCT, they are a substitute for the elderly in general. If you extend that to swine flu vaccine in the young, you are changing the nature of the substitute. That may or may not be a reasonable thing to do. It requires justification. You have to check what the target population is and that the study population provides a fair substitute for it. Thus even a pristine RCT isn't a pure gold standard, but an alloyed gold standard. You always have to check how much base metal there is.
Although randomization doesn't deal with the target population substitution, there is a reasonable expectation that it will make the counterfactual substitute (the pseudo Jane Does for the screened Jane Does) roughly comparable, i.e., that the substitute will be a good one. Even here, however, complications arise independent of randomization gone bad (i.e., that by odd chance the two groups will not be comparable on some factor of importance). This requires a longer discussion, though, and is best delayed to the next installment. How many installments will there be? I have no idea. I am just following the argument as it goes. I'm surprised it has taken this many.
If you are interested in counterfactuals, this article is useful: Estimating causal effects. Maldonado G, Greenland S. Int J Epidemiol. 2002 Apr;31(2):422-9. It's a deeper subject than most people give it credit for. As for the next installment, you don't have to wait. The comment threads are open.
- Log in to post comments
I am liking your discussion on random and counterfactuals very much.
Yet, it would be of help to me, and many others, to see how you would explain random versus fixed effects models. We seem to be having a real upsurge in folks who are taking samples in the environment that are not random and they honestly don't understand the difference between fixed versus random effects so they generalize (inappropriately) with fixed effects models just as they would with random effects models.
Please go into the diffferences in the way variances are partitioned in fixed versus random models and please be very concrete in your examples.
Dwight: First, thanks for the kind words. I'm not sure how much I want to get into the weeds on study design. I have another end in view and this has gotten unexpectedly long already. But I'll see. I'm still not sure where it is taking me.
Maldonado and Greenland mention early in the article that a relative risk is not necessarily a biological constant; this is worth emphasizing too. The half life of a chemical isotope is assumed to be a physical constant (otherwise the dating of a carbon-containing compound would be futile). But the ârelative riskâ of an âexposureâ may not be an analogous biological constant. Even if the substitutions for the counterfactual were perfect, and the external validity were perfect, the estimation of the relative risk could be subject to variation, depending on the distribution of other factors in different populations.
As they mention at the end of their article, this means that meta-analysis may not be so much an estimate of a fictional âcommon effectâ as it is a search for sources of systematic variation among study results. Making âmeta-analyses with homogeneityâ a âgold standardâ for medical evidence could, in this reading, be a bit of a fallacy. Since the homogeneous meta-analysis has an oracle-like status in many interpretations of the hierarchy of evidence for EBM, Maldonado and Greenlandâs article could reframe the meaning of this hierarchy.
To me, that seems like a pretty big deal, especially when some people get all persnickety about what constitutes good evidence for a medical hypothesis.
Ed: I'll have to dig up my copy, but a RR isn't a "property" of an organism or even an exposure and an organism. It's the product of a comparison and depends on the reference group so could hardly be a biological constant. But maybe I'm misreading what you were saying.
As Obama says of Harry Reid, my expression was inartful.
The fact that RR depends on so many aspects of the comparison being made has consequences for the interpretation of meta-analyses. Homogeneity and consistency are not the same thing. The consistency of an association across populations should not be viewed as necessary for a causal relationship.
Generally, when meta-analysis is described to general audiences, it comes across as an attempt to pool the association measures of smaller studies into a single association measure with a smaller standard error than those of the individual studies. This is what I read the article to be rejecting when it speaks of meta-analysis as a search for sources of systematic variation rather than "an exercise in estimating a fictional common effect."
To me, this interpretation of the Maldonado and Greenland paper suggests that it is a mistake to place meta-analyses with homogeneity at the top of the hierarchy of study designs for evidence in medicine. I have heard learned people speaking as if heterogenity somehow weakened the evidence for causation. This article suggests that this need not be true. Homogeneity may arise from pooling studies of similar populations with similar distributions of potential confounders, an artifact of various investigators' research agendas rather than of truth itself. That is worth remarking on.
Ed: Ahh. Yes. There are two uses of the term meta-analysis. One of them is indeed a pooled analysis. This is sometimes possible but not often. The other is an observational study of experimental studies, and that's usually what the Cochrane Collaboration produces. It is indeed what we will have to examine. You have discovered some of where I am going.
I liked your original post but I've been getting more and more restless since. All the discussions analyse the original study as if it were the end of the investigation. It should be the start. You have a hypothesis. You have some evidence to support the hypothesis. You don't have a clear biological mechanism. There are some problems with the design, as you correctly show there are problems with all designs. You don't have a huge effect or overwhelming statistical support. What to do? Get more data. Do more studies.
In the real world of medical research, wouldn't at least a few cardiologists in teaching hospitals somewhere in the world follow up this work? Wouldn't one try four groups (placebo, one pill a day, two pills a day, three pills a day)? Wouldn't another do a double blind randomized test comparing your treatment to alternate treatments? This is certainly what happens in other areas of scientific research.
ECaruthers: The assumption was there was a biological mechanism. That's why she tried it on her patients. So you can consider this confirmation of the biological hypothesis. How do you know there wasn't good statistical support? I didn't give you the results of the F-test. And what do you consider "good statistical support"? Should any result, whether an RCT or observational study be confirmed and replicated? Of course. The question here was about the logic of this trial and we still have more to explore. If you follow up on this, who is going to pay for the follow-up? Doctors in teaching hospitals can't do this without money. It's quite expensive to do an RCT. Does it happen in other areas of scientific research? All the time? I rather doubt it. Every paleontology finding? Every botanical finding? Unlikely.
I understand that your main interest is in the the logic of your test compared to other tests. But you intentionally set up your hypothetical to be suggestive but not conclusive. No one responded by saying this should become the new standard of care for similar patients. I actually agree with you in believing the results shouldn't be dismissed.
When I suggest 'do more studies' I don't only mean RCTs with clinical endpoints. I include exactly the kind of follow-up that David Rind described after the initial report of the effects of ritonavir and that quickly developed into triple therapy "cocktails" for HIV.
Your hypothetical 'magnitude of effect' is much smaller than in the ritonavir tests, so I wouldn't expect as many scientists to follow up. But every scientist I know just loves a new result in his or her area. Even a theory that the community strongly disbelieves gets tested by someone, just to say it's been disproved. See http://vosshall.rockefeller.edu/reprints/KellerVosshall2004.pdf
ECaruthers: It's possible for other docs to do the same off label thing, but actually setting up a trial, RCT or not, costs money and it has to come from somewhere. So while someone might get funded for this, it would take a while. But I agree that confirming findings is critical, although it is done much more rarely than you'd think.
Perhaps you already intend to discuss the comparitive costs of different study types and experimental designs. If not, I would suggest it as an additional future topic. If ramdomized controlled tests of drug effects are so expensive that they are only done when funded by drug companies, then it seems to me that you have another argument for alternative types of study.