Is Statistical Significance Always Significant?Department of Medicine, UCLA School of Medicine, Olive View-UCLA Medical Center, Sylmar, California Correspondence: Ronald L. Koretz, M.D., Department of Medicine, Olive View-UCLA Medical Center, 14445 Olive View Dr., Sylmar, CA 91342. Electronic mail may be sent to rkoretz{at}ladhs.org. One way in which we learn new information is to read the medical literature. Whether or not we do primary research, it is important to be able to read literature in a critical fashion. A seemingly simple concept in reading is to interpret p values. For most of us, if we find a p value that is <.05, we take the conclusion to heart and quote it at every opportunity. If the p value is >.05, we discard the paper and look elsewhere for useful information. Unfortunately, this is too simplistic an approach. The real utility of p values is to consider them within the context of the experiment being performed. Defects in study design can make an interpretation of a p value useless. One has to be wary of type I (seeing a "statistically significant" difference just because of chance) and type II (failing to see a difference that really exists) errors. Examples of the former are publication bias and the performance of multiple analyses; the latter refers to a trial that is too small to demonstrate the difference. Finding significant differences in surrogate or intermediate endpoints may not help us. We need to know if those endpoints reflect the behavior of clinical endpoints. Selectively citing significant differences and disregarding studies that do not find them is inappropriate. Small differences, even if they are statistically significant, may require too much resource expenditure to be clinically useful. This article explores these problems in depth and attempts to put p values in the context of studies.
The staff of your hospital has elected you to be a representative to its clinical policymaking committee. At your first meeting, a discussion is undertaken regarding whether or not to implement a policy of routinely providing preoperative nutrition support (parenteral or enteral) to patients who are scheduled for major surgery and have had modest preoperative weight losses (<10 pounds). One member of the committee submits a series of references to randomized controlled trials (RCTs) that have compared nutrition support with no nutrition support in surgical patients and have almost uniformly shown that the patients who received the active therapy had better nitrogen balance.1 In addition, he notes 1 RCT indicating that clinical outcome was improved with parenteral nutrition2 and 1 showing a similar result for enteral nutrition.3 However, another member of the committee cites references to a parenteral4 and an enteral5 trial that did not show any benefit from the intervention. Your immediate impression is to decide that the former argument is more persuasive, because you believe that people can find any single paper in the literature to support their point of view but the nitrogen balance data appeared to be from a multitude of RCTs. The chairman of the committee informs everyone that, should they decide to implement this policy, they will have to reduce some other resource expenditure to pay for it, and he suggests that, as a start, the pharmacy be closed from midnight to 6 AM (in other words, during those hours, no medications will be available unless the ward nurse goes down to the pharmacy and gets the medication). What do you do?
We have become inured to the presence of p values in clinical papers that we read. If that value is "<.05," we remember the result and quote it as gospel. If the p value is ">.05," we dismiss the finding as one that will never be of any use to us. (Some investigators even make such a distinction when considering p values of .049 and .051!) However, the use of statistical tests and results has to be put in the perspective of the methodology and viewed in a critical light. A number of misleading conclusions can occur if we just rely on the number .05. The purpose of the following discussion is to exemplify when statistically significant differences should be viewed with skepticism and when statistically insignificant differences might not be dismissed.
The first, and most obvious, issue relates to the reliability of .05 to detect truth. Strictly speaking, a p value never directly reflects truth. Rather, it provides us with the probability that the observation was due to chance. In this case, a p value <.05 tells us that there was a less than 5% probability that the observed difference occurred when there was no true 1. It must be realized that chance observations do occur. Consider a scenario in which an investigator makes 25 independent observations in an experiment. Let us assume that there is no real difference in any. However, if that observer does statistical testing on all of them, he is likely to see some spurious p values <.05. The probability that none of them would have a difference with a p value <.04 is actually relatively low: 0.9624 or 38%. In other words, if this experiment is repeated on many occasions, at least 1 p value <.05 would be seen 62% of the time. We refer to the phenomenon of seeing a statistically significant difference in a situation where no true difference exists as a type I error. These will happen in fewer than 1 in 20 studies if only a single observation is made. As was seen in the example, the chances of it happening in any particular study increase when multiple observations are made. This same error occurs with "publication bias." There is a tendency for positive results to be more likely to appear in the literature. Consider a scenario in which 25 different investigators perform the same clinical experiment in a small number of subjects. Although no true difference exists, one of the experimenters finds a p value <.05. It is more likely that that experience will be reported. This is why, when one reads the medical literature, one has to be wary of small trials showing significant differences. The type II error is the converse of the above. In this case, we fail to see a difference even though a true one does exist. As an extreme example, imagine a disease that has an expected mortality of 95% and a treatment that reduces that mortality rate to 5%. If only 2 patients are entered into a randomized trial of that treatment (1 treated, 1 control), the survival rates would likely be 100% (1/1) in the treated arm and 0% (0/1) in the control arm. However, statistical analysis would show no significant difference and the treatment could be judged as being ineffective.
In order to discuss type II errors, we must be familiar with several terms,
namely: null hypothesis,
Sample size refers to the number of subjects in a study. Statisticians can
calculate the sample size that would be needed to ensure a certain power if
they are given the sizes of Type II errors probably occur frequently in the medical literature.6 For example, several small RCTs were unable to find any difference in the clinical outcome when patients with active Crohn's disease were treated with steroids or with elemental diets. These observations led to the claim that the 2 treatments were equivalent. However, subsequent larger trials and 3 meta-analyses7–9 showed that steroid therapy was superior. High quality RCTs should provide, in the methods section, a description of a calculation of power.
The effects that are assessed in clinical trials are known as endpoints or outcomes. In general, we are interested in clinical outcomes, namely those that are important to patients. Clinical outcomes include mortality, morbidity (including quality of life), duration of the disease process, or (at least with regard to the payer) cost. However, clinical outcomes are often difficult to assess (because they are subtle or take a long time to occur). Because of this, investigators often choose to use laboratory tests or other measurements that are easier to obtain and that (hopefully) reflect clinical outcomes. We call such effects "surrogate" or "intermediate" outcomes. Body weight, nitrogen balance, or other nutrition parameters are surrogate markers in trials of nutrition support. If an intervention changes a surrogate outcome, but does not reduce the amount of pain and suffering (or cost), then such changes do not really make any difference to the patient. Simply stated, if we help a patient achieve weight gain, but the patient dies anyway, we have only created a larger burden for the pallbearer. It is important for the critical reader to appreciate the difference between clinical and surrogate outcomes. If we are going to use surrogate outcomes, we have to know that improving them does result in improvement in 1 or more clinical outcome(s). Referring back to our dilemma, 1 of the major reasons for buying into the pro argument was the fact that a number of studies demonstrated improvements in nitrogen balance. However, when correlations between improvement in nitrogen balance and improvement in various clinical outcomes were sought, none were found.1 This was true for other nutrition parameters as well.1 On the other hand, other surrogate outcomes do correlate with clinical ones. For example, improvements in the viral titer or CD4 count in patients with human immunodeficiency virus infections do correlate with improved clinical outcomes.10 Thus any given surrogate outcome has to be considered individually.
As was noted in the dilemma discussion, it is not hard to look through the literature and find particular papers that will support almost any position ("cherry-picking" the literature). This can even be true if one is restricting one's argument to data from RCTs. Thus, when one is mounting an evidence-based argument, it is more helpful to survey the entire body of the medical literature and see what the composite experience is. We refer to this process as systematic reviewing. As one could imagine, this is a time-consuming undertaking and is not practical if one is making an individual patient-care decision at the bedside. It is more important to do this if one is establishing policy; the very highest level of evidence is a systematic review of RCTs. The Cochrane Collaboration regularly publishes systematic reviews of a wide variety of specific clinical questions. These reviews are available in the Cochrane Library. Abstracts of these reviews are available online (www.cochrane.org); the Cochrane Library contains the complete reviews and is usually available in medical libraries. Considering the available evidence for perioperative nutrition support, the data have been disappointing. Parenteral nutrition (with a very limited number of possible exceptions) did not have any apparent beneficial effect.11 Enteral nutrition, although perhaps less harmful than parenteral nutrition,12 also did not appear to result in any clinically meaningful benefits.13
Another advantage of systematic reviewing is the ability to combine the data from multiple RCTs to gain a more precise estimate of the true effect. Although one can do this in a very simple manner (namely just adding together the numerators and denominators), a more sophisticated method (metaanalysis) is available. Meta-analysis uses the statistical reliability (variance) of each trial to assign a weighting factor to that study before combining the data. Variance depends on the standard deviation (ie, the range in which the "true" answer is likely to lie). In general, larger trials tend to have smaller standard deviations and are weighted more heavily. Of course, no matter how this more precise estimate is obtained, one should not confuse it with absolute truth. When meta-analysis is used, the estimate is accompanied by a confidence interval, typically a 95% interval. The 95% confidence interval represents the range over which 95% of the results would fall if the experiment were done repeatedly. When the extremes of a 95% confidence interval go in the same direction (either benefit or harm), this estimate is viewed similarly as a p value <.05. The availability of a reliable estimate of effect (even if using one or the other extreme of the confidence interval) is important with regard to making policy, particularly in a resource-limited environment, because the estimate can be converted into a "number needed to treat." For example, consider a particular intervention that increases the absolute incidence of a beneficial outcome by 5%. No single patient ever has a fraction of an outcome; rather, the probability of having the outcome is increased by 5%. Thus, on average, 1 patient out of 20 who are given the intervention will have the beneficial outcome that would not have occurred had the intervention not been provided. In this case, 20 is the number needed to treat to achieve 1 beneficial outcome. Let us consider an example of how a number needed to treat could be used to establish policy. Let us assume that 7 days of preoperative parenteral nutrition reduces the incidence of superficial postoperative wound infections by 5%. Under such a circumstance, we would have to invest the cost of hospitalization of 20 patients for 1 week plus the cost of the parenteral nutrition for all of them in order to avoid the cost of treating 1 wound infection. It would be cheaper just to treat the 1 infection. With regard to our dilemma, being able to estimate the effect of the intervention allows decision makers to predict more precisely how much the pharmacy would have to be shut down (or other cost-savings introduced) to compensate for the additional resource expenditure of the intervention. In the medical literature, certain costs have been accepted as standards for what society should be willing to pay. That number is $25,000–$50,000 for every quality-adjusted life-year obtained.14,15 A quality-adjusted life year takes into account the fact that time spent with symptoms is not the same as time spent symptom-free; if a symptom makes life only half as good, each year of survival with that symptom is only worth 0.5 quality-adjusted life years.16 Realistically, $25,000–$50,000 per quality-adjusted life year is more than society can afford. Let us just consider life-years without adjusting for quality (because quality adjustment makes the costs even higher). The gross domestic product (GDP) of the United States is about $11 trillion, and there are about 280,000,000 people living here. This means that, on average, each person produces about $40,000 worth of goods and services each year; this is the upper limit of what we can spend (if we spend it all on health care) without going into debt. If we are going to spend 15% of the GDP on health care, the actual number we should be using is about $6000 per life-year.
Another place to be wary of finding "statistical significance," although it is not relevant to the dilemma as presented, is the observation of significant differences when subgroups are analyzed. Several problems can be introduced. One obvious trap is to perform a large number of such analyses. Consider a situation in which an investigator performs a trial comparing a particular intervention to no treatment and is unable to find a difference. He or she then decides to look at subgroups of patients based on age, gender, ethnicity, eye color, left- or right-handedness, and the presence or absence of several comorbidities; in the process, 25 such analyses are performed. From our previous discussion, it would not be surprising if at least 1 of these comparisons were associated with a p value <.05. However, this may very well be a statistical quirk rather than a true difference. In a variant of this scenario, let us say that the investigator is looking at the data and notes that a number of people with a specific characteristic (eg, younger age) had particularly good outcomes. He or she then sets out to do a formal subgroup analysis. However, the p value representing the probability of a difference being due to chance has no meaning if the difference was subjected to analysis because it looked different at the outset. Another problem that has occurred is evaluating subgroups by a characteristic that cannot be prospectively identified. Consider an actual example— the investigators performed a large RCT comparing specialized enteral nutrient formulations to standard ones.17 At the end of the trial, no significant differences were seen with regard to length of stay or infection rates. When the analysis was performed only in the subgroup of patients who had received the highest percentage of the intended intake of nutrients, significant differences were observed, leading to the recommendation that the specialized formulation should be used. However, there is a logical conundrum here. If there was no difference overall, but a beneficial effect in 1 subgroup, there has to be a harmful effect in another. If we cannot prospectively identify the subgroup that will have the benefit, we have to provide the intervention to all of the patients, thereby doing as much harm as good. (The investigators could not ascertain, before the enteral nutrition was undertaken, which patients would tolerate the infusions.) Subgroup analyses should be especially viewed with skepticism if there was no a priori plan to perform them, or if nonprospectively identifiable subgroups are being assessed. At best, such analyses can only be "hypothesis-generating," not "hypothesis-proving." If there are subgroups for which the investigator believes an effect may be reasonably likely to occur, these subgroup analyses should be planned at the beginning of the study. In fact, in RCTs, these subgroups should be separated out (stratified) and independently randomized.
There are several situations in which the demonstration of a statistically significant difference may not, or will not, translate into a clinically meaningful outcome. These are noted in Table 1. As Gertrude Stein allegedly said, "For a difference to be a difference, it has to make a difference."
The best way to resolve this problem is to identify the most reliable estimate of the effect and then decide if that outcome is worth the resource investment. When policy is being made (as opposed to making individual clinical decisions at the bedside), it is not only justifiable, but even necessary, to search the literature for the most reliable data. Those data can then be combined (by simple addition or meta-analysis) to obtain the estimate. At times, a systematic review may already be available; the Cochrane Library is one place to begin to look. If a systematic review is not at hand, you (along with the other committee members) should undertake the task. Once the effect is known, an estimated resource cost can usually be easily identified. At that point, you are in a good position to decide if the resource investment is worthwhile.
Nutrition in Clinical Practice, Vol. 20, No. 3,
303-307 (2005) This article has been cited by other articles:
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

, β, power, and effect size. The null
hypothesis assumes that there is no difference between the 2 (or more)
interventions. A clinical trial is designed to compare the interventions and
use statistical tests to prove or disprove that null hypothesis. Thus the
p value (also known as 
