Do the risk benefit assignment using the attached information. Remember to answer ALL of the questions! I know this may be a difficult assignment, but it will go quickly if you set up the formulas and put in the numbers.
Go back to the Powerpoint and use the formulas provided. It is relatively straightforward if you use the formulas.
Please show your work rather than just the end number so I can better assess how you did it.
Assignment Week 2, Risk Benefit
Age, gender and weight matched patients were treated with loracserin, a new drug for weight
loss, or placebo, in conjunction with a diet and exercise program. The tables below summarize
results from different trials. An newer drug, semaglutide, was also studied.
Table 1 shows weight loss results for patients that completed the trial
Placebo
Lorcaserin
Weight loss>
10% body
weight
243
748
Total
participants
completing trial
5083
5135
Placebo
Semaglutide
Weight loss>
10% body
weight
12
68
Total
participants
completing trial
655
1306
Table 2 shows some adverse events.
Placebo
Lorcaserin
Placebo
Semaglutide
Headache
Nausea
Suicidal
Ideation
Total
Participants
15
37
81
198
19
35
114
544
11
21
83
124
5992
5995
655
1306
Table 3 shows outcomes for cardiovascular events for lorcarserin and semaglutide, another
new weight loss drug.
Placebo
Lorcaserin
Cardiovascular event
Total
369
364
6000
6000
Placebo
Semaglutide
Cardiovascular event
Total
70
107
655
1306
Analyze the benefits and risks of lorcaserin and semaglutide compared to placebo.
In particular, answer the following questions:
1. What is the odds ratio for lorcaserin producing greater than 10% weight loss?
2. What is the odds ratio for semaglutide producing greater than 10% weight loss?
3. What is the Relative risk for each drug for:
Lorcaserin
Semaglutide
a. Headache RR=
b. Nausea RR=
c. Suicidal ideation RR=
4. What is the relative risk or relative risk reduction for major cardiovascular events for
each drug?
Lorcaserin
Semaglutide
RRR=
5. Do you think the benefits of lorcaserin outweigh the risks given that the odds ratio for
myocardial infarction is 1.44 for a patient with a BMI over 30, which all of these patients
had at the beginning of the study?
6. What about the patients treated with semaglutide, who also had a BMI over 30 at the
beginning of the study?
7. Which drug would you choose, or do you think neither has a good enough risk benefit
ratio to be used?
CMAJ 2005: Tips for Learners of Evidence-Based Medicine: A 5-Part Series
Barratt A, WYer PC, Hatala R, McGinn T, Dans AL, Keitz S, Moyer V, Guyatt G. Tips for
learners of evidence-based medicine: 1. relative risk reduction, absolute risk reduction
and number needed to treat. Can Med Assoc J 2004; 171:353–358.
Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, Guyatt G. Tips for
learners of evidence-based medicine: 2. measures of precision (confidence intervals).
Can Med Assoc J 2004; 171:611–615.
McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt G. Tips for learners of
evidence-based medicine: 3. measures of observer variability (kappa statistic). Can
Med Assoc J 2004; 171:1369–1373.
Hatala R, Keitz S, Wyer P, Guyatt G. Tips for learners of evidence-based medicine: 4.
assessing heterogeneity of primary studies in systematic reviews and whether to
combine their results. Can Med Assoc J 2005;172:661–665.
Montori VM, Wyer P, Newman TB, Keitz S, Guyatt G. Tips for learners of evidencebased medicine: 5. the effect of spectrum of disease on the performance of diagnostic
tests. Can med Assoc J 2005;172:385–390.
Review
Synthèse
Tips for learners of evidence-based medicine:
1. Relative risk reduction, absolute risk reduction
and number needed to treat
Alexandra Barratt, Peter C. Wyer, Rose Hatala, Thomas McGinn, Antonio L. Dans, Sheri Keitz,
Virginia Moyer, Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group
ß See related article page 347
P
hysicians, patients and policy-makers are influenced
not only by the results of studies but also by how authors present the results.1–4 Depending on which
measures of effect authors choose, the impact of an intervention may appear very large or quite small, even though
the underlying data are the same. In this article we present
3 measures of effect — relative risk reduction, absolute risk
reduction and number needed to treat — in a fashion designed to help clinicians understand and use them. We
have organized the article as a series of “tips” or exercises.
This means that you, the reader, will have to do some work
in the course of reading this article (we are assuming that
most readers are practitioners, as opposed to researchers
and educators).
The tips in this article are adapted from approaches developed by educators with experience in teaching evidencebased medicine skills to clinicians.5,6 A related article, intended
for people who teach these concepts to clinicians, is available
online at www.cmaj.ca/cgi/content/full/171/4/353/DC1.
Clinician learners’ objectives
DOI:10.1503/cmaj.1021197
Understanding risk and risk reduction
• Learn how to determine control and treatment event
rates in published studies.
• Learn how to determine relative and absolute risk reductions from published studies.
• Understand how relative and absolute risk reductions
usually apply to different populations.
Balancing benefits and adverse effects in individual
patients
• Learn how to use a known relative risk reduction to estimate the risk of an event for a patient undergoing
treatment, given an estimate of that patient’s risk of the
event without treatment.
• Learn how to use absolute risk reductions to assess
whether the benefits of therapy outweigh its harms.
Calculating and using number needed to treat
• Develop an understanding of the concept of number
needed to treat (NNT) and how it is calculated.
• Learn how to interpret the NNT and develop an understanding of how the “threshold NNT” varies depending on the patient’s values and preferences, the
severity of possible outcomes and the adverse effects
(harms) of therapy.
Tip 1: Understanding risk and risk reduction
You can calculate relative and absolute risk reductions using simple mathematical formulas (see Appendix 1). However, you might find it easier to understand the concepts
through visual presentation. Fig. 1A presents data from a hypothetical trial of a new drug for acute myocardial infarction,
showing the 30-day mortality rate in a group of patients at
high risk for the adverse event (e.g., elderly patients with
congestive heart failure and anterior wall infarction). On the
basis of information in Fig. 1A, how would you describe the
Teachers of evidence-based medicine:
See the “Tips for teachers” version of this article online
at www.cmaj.ca/cgi/content/full/171/4/353/DC1. It
contains the exercises found in this article in fill-in-theblank format, commentaries from the authors on the
challenges they encounter when teaching these concepts
to clinician learners and links to useful online resources.
CMAJ • AUG. 17, 2004; 171 (4)
© 2004 Canadian Medical Association or its licensors
353
Barratt et al
effect of the new drug? (Hint: Consider the event rates in not most cases7,8), the absolute gains, represented by abpeople not taking the new drug and those who are taking it.)
solute risk reductions, are not. In sum, the absolute risk reWe can describe the difference in mortality (event) duction becomes smaller when event rates are low, whereas
rates in both relative and abthe relative risk reduction, or
solute terms. In this case,
“efficacy” of the treatment, ofthese high-risk patients had a
ten remains constant.
Risk and risk reduction: definitions
relative risk reduction of 25%
These phenomena may be
and an absolute risk reduction
factors in the design of drug
Event rate: the number of people experiencing an
of 10%.
trials. For example, a drug
event as a proportion of the number of people in
the population
Now, let’s consider Fig. 1B,
may be tested in severely afwhich shows the results of a
fected people in whom the
Relative risk reduction: the difference in event
second hypothetical trial of the
absolute risk reduction is likerates between 2 groups, expressed as a proportion
of the event rate in the untreated group; usually
same new drug, but in a patient
ly to be impressive, but is
7,8
constant across populations with different risks
population with a lower risk for
subsequently marketed for
the outcome (e.g., younger pause by less severely affected
Absolute risk reduction: the arithmetic difference
tients with uncomplicated infepatients, in whom the absobetween 2 event rates; varies with the underlying
risk of an event in the individual patient
rior wall myocardial infarclute risk reduction will be
tion). Looking at Fig. 1B, how
substantially less.
The absolute risk reduction becomes smaller
would you describe the effect
when event rates are low, whereas the
of the new drug?
The bottom line
relative risk reduction, or “efficacy” of the
The relative risk reduction
treatment, often remains constant
with the new drug remains at
Relative risk reduction is
25%, but the event rate is lowoften more impressive than
er in both groups, and hence
absolute risk reduction. Furthe absolute risk reduction is only 2.5%.
thermore, the lower the event rate in the control group,
Although the relative risk reduction might be similar the larger the difference between relative risk reduction
across different risk groups (a safe assumption in many if and absolute risk reduction.
Risk for outcome
of interest, %
A
40
Risk for outcome
of interest, %
Absolute risk reduction (also called the risk difference) is the simple difference in the event
rates (40% – 30% = 10%).
30
Relative risk reduction is the difference between the event rates in relative terms. Here, the
event rate in the treatment group is 25% less than the event rate in the control group (i.e., the
10% absolute difference expressed as a proportion of the control rate is 10/40 or
25% less).
20
10
0
B
Among high-risk patients in trial 1, the event rate in the control group (placebo) is 40 per
100 patients, and the event rate in the treatment group is 30 per 100 patients.
Trial 1: highrisk patients
Placebo
Treatment
40
Among low-risk patients in trial 2, the event rate in the control group (placebo) is only 10%.
If the treatment is just as effective in these low-risk patients, what event rate can we expect
in the treatment group?
30
20
The event rate in the treated group would be 25% less than in the control group or 7.5%.
Therefore, the absolute risk reduction for the low-risk patients (second pair of columns) is only
2.5%, even though the relative risk reduction is the same as for the high-risk patients
(first pair of columns).
10
0
Trial 1: highrisk patients
Trial 2: lowrisk patients
Fig. 1: Results of hypothetical placebo-controlled trials of a new drug for acute myocardial infarction. The bars represent the 30day mortality rate in different groups of patients with acute myocardial infarction and heart failure. A: Trial involving patients at
high risk for the adverse outcome. B: Trials involving a group of patients at high risk for the adverse outcome and another group of
patients at low risk for the adverse outcome.
354
JAMC • 17 AOÛT 2004; 171 (4)
Tips for learners of evidence-based medicine
Tip 2: Balancing benefits and adverse effects
in individual patients
In prescribing medications or other treatments, physicians consider both the potential benefits and the potential
harms. We have just demonstrated that the benefits of
treatment (presented as absolute risk reductions) will generally be greater in patients at higher risk of adverse outcomes than in patients at lower risk of adverse outcomes.
You must now incorporate the possibility of harm into
your decision-making.
First, you need to quantify the potential benefits. Assume you are managing 2 patients for high blood pressure
and are considering the use of a new antihypertensive drug,
drug X, for which the relative risk reduction for stroke over
3 years is 33%, according to published randomized controlled trials.
Pat is a 69-year-old woman whose blood pressure during a routine examination is 170/100 mm Hg; her blood
pressure remains unchanged when you see her again 3
weeks later. She is otherwise well and has no history of cardiovascular or cerebrovascular disease. You assess her risk
of stroke at about 1% (or 1 per 100) per year.9
Dorothy is also 69 years of age, and her blood pressure
is the same as Pat’s, 170/100 mm Hg; however, because she
had a stroke recently, you assess her risk of subsequent
stroke as higher than Pat’s, perhaps 10% per year.10
One way of determining the potential benefit of a new
treatment is to complete a benefit table such as Table 1A.
To do this, insert your estimated 3-year event rates for Pat
and Dorothy, and then apply the relative risk reduction
(33%) expected if they take drug X. It is clear from Table
1A that the absolute risk reduction for patients at higher
risk (such as Dorothy) is much greater than for those at
lower risk (such as Pat).
Now, you need to factor the potential harms (adverse effects associated with using the drug) into the clinical decision. In the clinical trials of drug X, the risk of severe gastric bleeding increased 3-fold over 3 years in patients who
received the drug (relative risk of 3). A population-based
study has reported the risk of severe gastric bleeding for
women in your patients’ age group at about 0.1% per year
(regardless of their risk of stroke). These data can now be
added to the table to allow a more balanced assessment of
the benefits and harms that could arise from treatment
(Table 1B).
Considering the results of this process, would you give
drug X to Pat, to Dorothy or to both?
In making your decisions, remember that there is not
necessarily one “right answer” here. Your analysis might go
something like this:
Pat will experience a small benefit (absolute risk reduction over 3 years of about 1%), but this will be considerably
offset by the increased risk of gastric bleeding (absolute risk
increase over 3 years of 0.6%). The potential benefit for
Dorothy (absolute risk reduction over 3 years of about 10%)
is much greater than the increased risk of harm (absolute
risk increase over 3 years of 0.6%). Therefore, the benefit of
treatment is likely to be greater for Dorothy (who is at
higher risk of stroke) than for Pat (who is at lower risk).
Assessment of the balance between benefits and harms
depends on the value that patients place on reducing their
risk of stoke in relation to the increased risk of gastric
bleeding. Many patients might be much more concerned
about the former than the latter.
Table 1A: Benefit table*
3-yr event rate for stroke, %
Patient group
At lower risk (e.g., Pat)
At higher risk (e.g., Dorothy)
No
treatment
With treatment
(drug X)
Absolute
risk reduction, %
(no treatment – treatment)
3
30
2
20
1
10
*Based on data from a randomized controlled trial of drug X, which reported a 33% relative risk reduction for the outcome
(stroke) over 3 years.
Table 1B: Benefit and harm table
3-yr event rate for stroke, %
Patient group
At lower risk
(e.g., Pat)
At higher risk
(e.g., Dorothy)
No
treatment
3-yr event rate for severe gastric bleeding, %
With treatment
Absolute risk reduction
(drug X)
(no treatment – treatment)
No
treatment
With treatment
(drug X)
Absolute risk increase
(treatment – no treatment)
3
2
1
0.3
0.9
0.6
30
20
10
0.3
0.9
0.6
*Based on data from randomized controlled trials of drug X reporting a 33% relative risk reduction for the outcome (stroke) over 3 years and a 3-fold increase for the adverse effect
(severe gastric bleeding) over the same period.
CMAJ • AUG. 17, 2004; 171 (4)
355
Barratt et al
Number needed to treat: definitions
Number needed to treat: the number of patients who
would have to receive the treatment for 1 of them to
benefit; calculated as 100 divided by the absolute risk
reduction expressed as a percentage (or 1 divided by the
absolute risk reduction expressed as a proportion; see
Appendix 1)
Number needed to harm: the number of patients who
would have to receive the treatment for 1 of them to
experience an adverse effect; calculated as 100 divided
by the absolute risk increase expressed as a percentage
(or 1 divided by the absolute risk increase expressed as a
proportion)
The bottom line
When available, trial data regarding relative risk reductions (or increases), combined with estimates of baseline
(untreated) risk in individual patients, provide the basis for
clinicians to balance the benefits and harms of therapy for
their patients.
Tip 3: Calculating and using number needed
to treat
Some physicians use another measure of risk and benefit, the number needed to treat (NNT), in considering the
consequences of treating or not treating. The NNT is the
number of patients to whom a clinician would need to administer a particular treatment to prevent 1 patient from
having an adverse outcome over a predefined period of
time. (It also reflects the likelihood that a particular patient
to whom treatment is administered will benefit from it.) If,
for example, the NNT for a treatment is 10, the practitioner would have to give the treatment to 10 patients to
prevent 1 patient from having the adverse outcome over
the defined period, and each patient who received the treatment would have a 1 in 10 chance of being a beneficiary.
If the absolute risk reduction is large, you need to treat
only a small number of patients to observe a benefit in at
least some of them. Conversely, if the absolute risk reduction is small, you must treat many people to observe a benefit in just a few.
An analogous calculation to the one used to determine
the NNT can be used to determine the number of patients
who would have to be treated for 1 patient to experience an
adverse event. This is the number needed to harm (NNH),
which is the inverse of the absolute risk increase.
How comfortable are you with estimating the NNT
for a given treatment? For example, consider the following questions: How many 60-year-old patients with hypertension would you have to treat with diuretics for a period of 5 years to prevent 1 death? How many people with
myocardial infarction would you have to treat with βblockers for 2 years to prevent 1 death? How many people
with acute myocardial infarction would you have to treat
with streptokinase to prevent 1 person from dying in the
next 5 weeks? Compare your answers with estimates derived from published studies (Table 2). How accurate
were your estimates? Are you surprised by the size of the
NNT values?
Physicians often experience problems in this type of
exercise, usually because they are unfamiliar with the calculation of NNT. Here is one way to think about it. If a
disease has a mortality rate of 100% without treatment
and therapy reduces that mortality rate to 50%, how
many people would you need to treat to prevent 1 death?
From the numbers given, you can probably figure out that
treating 100 patients with the otherwise fatal disease results in 50 survivors. This is equivalent to 1 out of every 2
treated. Since all were destined to die, the NNT to prevent 1 death is 2. The formula reflected in this calculation
is as follows: the NNT to prevent 1 adverse outcome
equals the inverse of the absolute risk reduction. Table 3
illustrates this concept further. Note that, if the absolute
risk reduction is presented as a percentage, the NNT is
Table 2: Benefit table for patients with cardiovascular problems
Event rate, %
Clinical question
Control group
Treatment group
ARR, %
NNT
What is the reduction in risk of stroke within 5
years among 60-year-old patients with
hypertension who are treated with diuretics?11
2.9
1.9
1.00
100
What is the reduction in risk of death within 2
years after MI among 60-year-old patients treated
with β-blockers?12
9.8
7.3
2.50
40
What is the reduction in risk of death within 5
weeks after acute MI among 60-year-old patients
treated with streptokinase?13
12.0
9.2
2.80
36
Note: MI = myocardial infarction, ARR = absolute risk reduction, NNT = number needed to treat.
356
JAMC • 17 AOÛT 2004; 171 (4)
Tips for learners of evidence-based medicine
Table 3: Calculation of NNT from absolute risk reduction*
Form of absolute
risk reduction
Calculation
of NNT
Example
Percentage (e.g., 2.8%)
Proportion (e.g., 0.028)
100/ARR
1/ARR
100/2.8 = 36
1/0.028 = 36
*Using absolute risk reduction in last row of Table 2.13
100/absolute risk reduction; if the absolute risk reduction
is expressed as a proportion, the NNT is 1/absolute risk
reduction. Both methods give the same answer, so use
whichever you find easier.
It can be challenging for clinicians to estimate the baseline risks for specific populations. For example, some physicians may have little idea of the risk of stroke over 5 years
among patients with hypertension. Physicians may also
overestimate the effect of treatment, which leads them to
ascribe larger absolute risk reductions and smaller NNT
values than are actually the case.14
Now that you know how to determine the NNT from
the absolute risk reduction, you must also consider whether
the NNT is reasonable. In other words, what is the maximum NNT that you and your patients will accept as justifying the benefits and harms of therapy? This is referred to
as the threshold NNT.15 If the calculated NNT is above
the threshold, the benefits are not large enough (or the risk
of harm is too great) to warrant initiating the therapy.
Determinants of the threshold NNT include the patient’s own values and preferences, the severity of the outcome that would be prevented, and the costs and side effects of the intervention. Thus, the threshold NNT will
almost certainly be different for different patients, and
there is no simple answer to the question of when an NNT
is sufficiently low to justify initiating treatment.
The bottom line
NNT is a concise, clinically useful presentation of the
effect of an intervention. You can easily calculate it from
the absolute risk reduction (just remember to check
whether the absolute risk reduction is presented as a percentage or a proportion and use a numerator of 100 or 1
accordingly). Be careful not to overestimate the effect of
treatments (i.e., use a value of absolute risk reduction that is
too high) and thus underestimate the NNT.
Conclusions
Clinicians seeking to apply clinical evidence to the care
of individual patients need to understand and be able to
calculate relative risk reduction, absolute risk reduction
and NNT from data presented in clinical trials and systematic reviews. We have described and defined these
concepts and presented tabular tools and equations to
help clinicians overcome common pitfalls in acquiring
these skills.
This article has been peer reviewed.
From the School of Public Health, University of Sydney, Sydney, Australia (Barratt); the Columbia University College of Physicians and Surgeons, New York, NY
(Wyer); the Department of Medicine, University of British Columbia, Vancouver,
BC (Hatala); Mount Sinai Medical Center, New York, NY (McGinn); the Department of Internal Medicine, University of the Philippines College of Medicine,
Manila, The Philippines (Dans); Durham Veterans Affairs Medical Center and
Duke University Medical Center, Durham, NC (Keitz); the Department of Pediatrics, University of Texas, Houston, Tex. (Moyer); and the Departments of Medicine and of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt)
Competing interests: None declared.
Contributors: Alexandra Barratt contributed tip 2, drafted the manuscript, coordinated input from coauthors and reviewers and from field-testing and revised all
drafts. Peter Wyer edited drafts and provided guidance in developing the final format. Rose Hatala contributed tip 1, coordinated the internal review process and
provided comments throughout development of the manuscript. Thomas McGinn
contributed tip 3 and provided comments throughout development of the manuscript. Antonio Dans reviewed all drafts and provided comments throughout development of the manuscript. Sheri Keitz conducted field-testing of the tips and contributed material from the field-testing to the manuscript. Virginia Moyer
reviewed and contributed to the final version of the manuscript. Gordon Guyatt
helped to write the manuscript (as an editor and coauthor).
References
1. Malenka DJ, Baron JA, Johansen S, Wahrenberger JW, Ross JM. The framing effect of relative and absolute risk. J Gen Intern Med 1993;8:543-8.
2. Forrow L, Taylor WC, Arnold RM. Absolutely relative: How research results
are summarized can affect treatment decisions. Am J Med 1992;92:121-4.
3. Naylor CD, Chen E, Strauss B. Measured enthusiasm: Does the method of
reporting trial results alter perceptions of therapeutic effectiveness? Ann Intern Med 1992;117:916-21.
4. Fahey T, Griffiths S, Peters TJ. Evidence based purchasing: understanding
results of clinical trials and systematic reviews. BMJ 1995;311:1056-60.
5. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures of association. In: Guyatt G, Rennie D, editors. The users’ guides to the
medical literature: a manual of evidence-based clinical practice. Chicago: AMA
Publications; 2002. p. 351-68.
6. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips
for learning and teaching evidence-based medicine: introduction to the series.
CMAJ 2004;171(4):347-8.
7. Schmid CH, Lau J, McIntosh MW, Cappelleri JC. An empirical study of the
effect of the control rate as a predictor of treatment efficacy in meta-analysis
of clinical trials. Stat Med 1998;17:1923-42.
8. Furukawa TA, Guyatt GH, Griffith LE. Can we individualise the number
needed to treat? An empirical study of summary effect measures in metaanalyses. Int J Epidemiol 2002;31:72-6.
9. SHEP Cooperative Research Group. Prevention of stroke by anti-hypertensive drug treatment in older persons with isolated systolic hypertension. Final
results of the Systolic Hypertension in the Elderly Program (SHEP). JAMA
1991;265:3255-64.
10. SALT Collaborative Group. Swedish Aspirin Low-dose Trial (SALT) of
75mg aspirin as secondary prophylaxis after cerebrovascular events. Lancet
1991;338:1345-9.
11. Psaty BM, Smith NL, Siscovick DS, Koepsell TD, Weiss NS, Heckbert
SR. Health outcomes associated with antihypertensive therapies used as
first-line agents. A systematic review and meta-analysis. JAMA 1997;277:
739-45.
12. β-Blocker Health Attack Trial Research Group. A randomized trial of propranolol in patients with acute myocardial infarction. I. Mortality results.
JAMA 1982;247:1707-14.
13. ISIS-2 Collaborative Group. Randomised trial of intravenous streptokinase,
oral aspirin, both or neither among 17 187 cases of suspected acute myocardial infarction: ISIS-2. Lancet 1988;2:349-60.
14. Chatellier G, Zapletal E, Lemaitre D, Menard J, Degoulet P. The number
needed to treat: a clinically useful nomogram in its proper context. BMJ 1996;
312:426-9.
15. Sinclair JC, Cook RJ, Guyatt GH, Pauker SG, Cook DJ. When should an effective treatment be used? Derivation of the threshold number needed to treat
and the minimum event rate for treatment. J Clin Epidemiol 2001;54:253-62.
Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,
Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet
.att.net
CMAJ • AUG. 17, 2004; 171 (4)
357
Barratt et al
Members of the Evidence-Based Medicine Teaching Tips
Working Group: Peter C. Wyer (project director), Columbia
University College of Physicians and Surgeons, New York, NY;
Deborah Cook, Gordon Guyatt (general editor), Ted Haines,
Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose
Hatala (internal review coordinator), Department of Medicine,
University of British Columbia, Vancouver, BC; Robert Hayward
(editor, online version), Bruce Fisher, University of Alberta,
Edmonton, Alta.; Sheri Keitz (field-test coordinator), Durham
Veterans Affairs Medical Center and Duke University, Durham,
NC; Alexandra Barratt, University of Sydney, Sydney, Australia;
Pamela Charney, Albert Einstein College of Medicine, Bronx, NY;
Antonio L. Dans, University of the Philippines College of
Medicine, Manila, The Philippines; Barnet Eskin, Morristown
Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory
University, Atlanta, Ga.; Hui Lee, formerly Group Health Centre,
Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas
McGinn, Mount Sinai Medical Center, New York, NY; Victor M.
Montori, Department of Medicine, Mayo Clinic College of
Medicine, Rochester, Minn.; Virginia Moyer, University of Texas,
Houston, Tex.; Thomas B. Newman, University of California, San
Francisco, Calif.; Jim Nishikawa, University of Ottawa, Ottawa,
Ont.; W. Scott Richardson, Wright State University, Dayton,
Ohio; Mark C. Wilson, University of Iowa, Iowa City, Iowa
Appendix 1: Formulas for commonly used measures of
therapeutic effect
Measure of effect
Formula
Relative risk
(Event rate in intervention group) ÷ (event
rate in control group)
Relative risk reduction
1 – relative risk
or
(Absolute risk reduction) ÷ (event rate in
control group)
Absolute risk reduction
(Event rate in intervention group) – (event
rate in control group)
Number needed to treat
1 ÷ (absolute risk reduction)
Fred Sebastian
Please, reader, can you spare some time?
Our annual CMAJ readership survey begins September 20. By telling us a
little about who you are and what you think of CMAJ, you’ll help us pave
our way to an even better journal. For 2 weeks, we’ll be asking you to take
the survey route on one of your visits to the journal online. We hope you’ll
go along with the detour and help us stay on track.
Chers lecteurs et lectrices, pourriez-vous nous accorder un moment?
Le sondage annuel auprès des lecteurs du JAMC débute le 20 septembre. En nous parlant un peu de
vous et de ce que vous pensez du JAMC, vous nous aiderez à améliorer encore le journal. Pendant
deux semaines, lorsque vous rendrez visite au journal électronique, nous vous demanderons de passer
une fois par la page du sondage. Nous espérons que vous accepterez de faire ce détour qui contribuera à nous garder sur la bonne voie.
358
JAMC • 17 AOÛT 2004; 171 (4)
Review
Synthèse
Tips for learners of evidence-based medicine:
2. Measures of precision (confidence intervals)
Victor M. Montori, Jennifer Kleinbart, Thomas B. Newman, Sheri Keitz, Peter C. Wyer,
Virginia Moyer, Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group
DOI:10.1503/cmaj.1031667
I
n the first article in this series,1 we presented an approach to understanding how to estimate a treatment’s
effectiveness that covered relative risk reduction, absolute risk reduction and number needed to treat. But how
precise are these estimates of treatment effect?
In reading the results of clinical trials, clinicians often
come across 2 related but different statistical measures of an
estimate’s precision: p values and confidence intervals. The p
value describes how often apparent differences in treatment
effect that are as large as or larger than those observed in a
particular trial will occur in a long run of identical trials if in
fact no true effect exists. If the observed differences are sufficiently unlikely to occur by chance alone, investigators reject the hypothesis that there is no effect. For example, consider a randomized trial comparing diuretics with placebo
that finds a 25% relative risk reduction for stroke with a p
value of 0.04. This p value means that, if diuretics were in
fact no different in effectiveness than placebo, we would expect, by the play of chance alone, to observe a reduction —
or increase — in relative risk of 25% or more in 4 out of
100 identical trials.
Although they are useful for investigators planning how
large a study needs to be to demonstrate a particular magnitude of effect, p values fail to provide clinicians and patients with the information they most need, i.e., the range
of values within which the true effect is likely to reside.
However, confidence intervals provide exactly that information in a form that pertains directly to the process of deciding whether to administer a therapy to patients. If the
range of possible true effects encompassed by the confidence interval is overly wide, the clinician may choose to
administer the therapy only selectively or not at all.
Confidence intervals are therefore the topic of this article. For a nontechnical explanation of p values and their
limitations, we refer interested readers to the Users’ Guides
to the Medical Literature.2
As with the first article in this series,1 we present the information as a series of “tips” or exercises. This means that
you, the reader, will have to do some work in the course of
reading the article. The tips we present here have been
adapted from approaches developed by educators experienced in teaching evidence-based medicine skills to clinicians.2-4 A related article, intended for people who teach
these concepts to clinicians, is available online at www.
cmaj.ca/cgi/content/full/171/6/611/DC1.
Clinician learners’ objectives
Making confidence intervals intuitive
• Understand the dynamic relation between confidence
intervals and sample size.
Interpreting confidence intervals
• Understand how the confidence intervals around estimates of treatment effect can affect therapeutic decisions.
Estimating confidence intervals for extreme
proportions
• Learn a shortcut for estimating the upper limit of the
95% confidence intervals for proportions with very
small numerators and for proportions with numerators
very close to the corresponding denominators.
Tip 1: Making confidence intervals intuitive
Imagine a hypothetical series of 5 trials (of equal duration but different sample sizes) in which investigators have
experimented with treatments for patients who have a particular condition (elevated low-density lipoprotein cholesterol) to determine whether a drug (a novel cholesterollowering agent) would work better than a placebo to
prevent strokes (Table 1A). The smallest trial enrolled only
Teachers of evidence-based medicine:
See the “Tips for teachers” version of this article online
at www.cmaj.ca/cgi/content/full/171/6/611/DC1. It
contains the exercises found in this article in fill-in-theblank format, commentaries from the authors on the
challenges they encounter when teaching these concepts
to clinician learners and links to useful online resources.
CMAJ • SEPT. 14, 2004; 171 (6)
© 2004 Canadian Medical Association or its licensors
611
Montori et al
8 patients, and the largest enrolled 2000 patients, and half
of the patients in each trial underwent the experimental
treatment. Now imagine that all of the trials showed a relative risk reduction for the treatment group of 50% (meaning that patients in the drug treatment group were only half
as likely as those in the placebo group to have a stroke). In
each individual trial, how confident can we be that the true
value of the relative risk reduction is important for patients
(i.e., “patient-important”)?5 If you were to look at the studies individually, which ones would lead you to recommend
the treatment unequivocally to your patients?
Most clinicians might intuitively guess that we could be
more confident in the results of the larger trials. Why is this?
In the absence of bias or systematic error, the results of a trial
can be interpreted as an estimate of the true magnitude of effect that would occur if all possible eligible patients had been
included. When only a few of these patients are included, the
play of chance alone may lead to a result that is quite different from the true value. Confidence intervals are a numeric
measure of the range within which such variation is likely to
occur. The 95% confidence intervals that we often see in
biomedical publications represent the range within which we
are likely to find the underlying true treatment effect.
To gain a better appreciation of confidence intervals, go
back to Table 1A (don’t look yet at Table 1B!) and take a
guess at what you think the confidence intervals might be
for the 5 trials presented. In a moment you’ll see how your
Table 1A: Relative risk and relative risk reduction observed
in 5 successively larger hypothetical trials
Control event
rate
Treatment
event rate
Relative risk, %
Relative risk
reduction, %*
2/4
10/20
20/40
50/100
500/1000
1/4
5/20
10/40
25/100
250/1000
50
50
50
50
50
50
50
50
50
50
*Calculated as the absolute difference between the control and treatment event rates
(expressed as a fraction or a percentage), divided by the control event rate. In the first row
in this table, relative risk reduction = (2/4 –1/4) ÷ 2/4 = 1/2 or 50%. If the control event
rate were 3/4 and the treatment event rate 1/4, the relative risk reduction would be
(3/4 – 1/4) ÷ 3/4 = 2/3. Using percentages for the same example, if the control event rate
were 75% and the treatment event rate were 25%, the relative risk reduction would be
(75% – 25%) ÷ 75% = 67%.
estimates compare to 95% confidence intervals calculated
using a formula, but for now, try figuring out intervals that
you intuitively feel to be appropriate.
Now, consider the first trial, in which 2 out of 4 patients
who receive the control intervention and 1 out of 4 patients
who receive the experimental treatment suffer a stroke.
The risk in the treatment group is half that in the control
group, which gives us a relative risk of 50% and a relative
risk reduction of 50% (see Table 1A).1,6
Given the substantial relative risk reduction, would you
be ready to recommend this treatment to a patient? Before
you answer this question, consider whether it is plausible,
with so few patients in the study, that the investigators might
just have gotten lucky and the true treatment effect is really a
50% increase in relative risk. In other words, is it plausible
that the true event rate in the group that received treatment
was 3 out of 4 instead of 1 out of 4? If you accept that this
large, harmful effect might represent the underlying truth,
would you also accept that a relative risk reduction of 90%,
i.e., a very large benefit of treatment, is consistent with the
experimental data in these few patients? To the extent that
these suggestions are plausible, we can intuitively create a
range of plausible truth of “-50% to 90%” surrounding the
relative risk reduction of 50% that was actually observed.
Now, do this for each of the other 4 trials. In the trial with
20 patients in each group, 10 of those in the control group
suffered a stroke, as did 5 of those in the treatment group.
Both the relative risk and the relative risk reduction are again
50%. Do you still consider it plausible that the true event rate
in the treatment group is 15 out of 20 rather than 5 out of 20
(the same proportions as we considered in the smaller trial)?
If not, what about 12 out of 20? The latter would represent a
20% increase in risk over the control rate (12/20 v. 10/20). A
true relative risk reduction of 90% may still be plausible,
given the observed results and the numbers of patients involved. In short, given this larger number of patients and the
lower chance of a “bad sample,” the “range of plausible truth”
around the observed relative risk reduction of 50% might be
narrower, perhaps from a relative risk increase of 20% (represented as –20%) to a relative risk reduction of 90%.
You can develop similar intuitively derived confidence
intervals for the larger trials. We’ve done this in Table 1B,
which also shows the 95% confidence intervals that we cal-
Table 1B: Confidence intervals (CIs) around the relative risk reduction in
5 successively larger hypothetical trials
CI around relative risk reduction, %
Control
event rate
Treatment
event rate
Relative
risk, %
Relative risk
reduction, %
Intuitive CI*
Calculated 95% CI*†
2/4
10/20
20/40
50/100
500/1000
1/4
5/20
10/40
25/100
250/1000
50
50
50
50
50
50
50
50
50
50
–50 to 90
–20 to 90
0 to 90
20 to 80
40 to 60
–174 to 92
–14 to 79.5
9.5 to 73.4
26.8 to 66.4
43.5 to 55.9
*Negative values represent an increase in risk relative to control. See text for further explanation.
†Calculated by statistical software.
612
JAMC • 14 SEPT. 2004; 171 (6)
Tips for EBM learners: confidence intervals
culated using a statistical program called StatsDirect (available commercially through www.statsdirect.com). You can
see that in some instances we intuitively overestimated or
underestimated the intervals relative to those we derived
using the statistical formulas.
The bottom line
Confidence intervals inform clinicians about the range
within which the true treatment effect might plausibly lie,
given the trial data. Greater precision (narrower confidence
intervals) results from larger sample sizes and consequent
larger number of events. Statisticians (and statistical software) can calculate 95% confidence intervals around any
estimate of treatment effect.
would you recommend this treatment to your patients if
the point estimate represented the truth? What if the upper
boundary of the confidence interval represented the truth?
Or the lower boundary?
For all 3 of these questions, the answer is yes, provided
that 1% is in fact the smallest patient-important difference.
Thus, the trial is definitive and allows a strong inference
about the treatment decision.
In the case of trial 2 (see Fig. 1B), would your patients
choose to undergo the treatment if either the point estimate
or the upper boundary of the confidence interval represented
the true effect? What about the lower boundary? The answer regarding the lower boundary is no, because the effect
is less than the smallest difference that patients would consider large enough for them to undergo the treatment. Al-
Tip 2: Interpreting
confidence intervals
You should now have an understanding of the relation between the
width of the confidence interval
around a measure of outcome in a
clinical trial and the number of participants and events in that study.
You are ready to consider whether a
study is sufficiently large, and the resulting confidence intervals sufficiently narrow, to reach a definitive
conclusion about recommending the
therapy, after taking into account
your patient’s values, preferences and
circumstances.
The concept of a minimally important treatment effect proves useful
in considering the issue of when a
study is large enough and has therefore generated confidence intervals
that are narrow enough to recommend for or against the therapy. This
concept requires the clinician to
think about the smallest amount of
benefit that would justify therapy.
Consider a set of hypothetical trials. Fig. 1A displays the results of trial
1. The uppermost point of the bell
curve is the observed treatment effect
(the point estimate), and the tails of
the bell curve represent the boundaries of the 95% confidence interval.
For the medical condition being investigated, assume that a 1% absolute
risk reduction is the smallest benefit
that patients would consider to outweigh the downsides of therapy.
Given the information in Fig. 1A,
Treatment helps
Treatment harms
Trial 1
A
-5
-3
-1
0
1
3
5
Trial 1
B
Trial 2
-5
-3
-1
0
1
3
5
-1
0
1
3
5
Trial 3
C
Trial 4
-5
-3
% Absolute risk reduction
Fig. 1: Results of 4 hypothetical trials. For the medical condition under investigation,
an absolute risk reduction of 1% (double vertical rule) is the smallest benefit that patients would consider important enough to warrant undergoing treatment. In each
case, the uppermost point of the bell curve is the observed treatment effect (the point
estimate), and the tails of the bell curve represent the boundaries of the 95% confidence interval. See text for further explanation.
CMAJ • SEPT. 14, 2004; 171 (6)
613
Montori et al
though trial 2 shows a “positive” result (i.e., the confidence
interval does not encompass zero), the sample size was inadequate and the result remains compatible with risk reductions
below the minimal patient-important difference.
When a study result is positive, you can determine
whether the sample size was adequate by checking the lower
boundary of the confidence interval, the smallest plausible
treatment effect compatible with the results. If this value is
greater than the smallest difference your patients would
consider important, the sample size is adequate and the trial
result definitive. However, if the lower boundary falls below
the smallest patient-important difference, leaving patients
uncertain as to whether taking the treatment is in their best
interest, the trial is not definitive. The sample size is inadequate, and further trials are required.
What happens when the confidence interval for the effect of a therapy includes zero (where zero means “no effect” and hence a negative result)?
For studies with negative results — those that do not exclude a true treatment effect of zero — you must focus on
the other end of the confidence interval, that representing
the largest plausible treatment effect consistent with the
trial data. You must consider whether the upper boundary
of the confidence interval falls below the smallest difference
that patients might consider important. If so, the sample
size is adequate, and the trial is definitively negative (see
trial 3 in Fig. 1C). Conversely, if the upper boundary exceeds the smallest patient-important difference, then the
trial is not definitively negative, and more trials with larger
sample sizes are needed (see trial 4 in Fig. 1C).
The bottom line
To determine whether a trial with a positive result is sufficiently large, clinicians should focus on the lower boundary of
the confidence interval and determine if it is greater than the
smallest treatment benefit that patients would consider important enough to warrant taking the treatment. For studies
with a negative result, clinicians should examine the upper
boundary of the confidence interval to determine if this value
is lower than the smallest treatment benefit that patients
would consider important enough to warrant taking the treatment. In either case, if the confidence interval overlaps the
smallest treatment benefit that is important to patients, then
the study is not definitive and a larger study is needed.
Table 2: The 3/n rule to estimate the upper limit of the
95% confidence interval (CI) for proportions with 0 in the
numerator
n
20
100
300
1000
614
Observed
proportion
3/n
Upper limit of
95% CI
0/20
0/100
0/300
0/1000
3/20
3/100
3/300
3/1000
0.15 or 15%
0.03 or 3%
0.01 or 1%
0.003 or 0.3%
JAMC • 14 SEPT. 2004; 171 (6)
Tip 3: Estimating confidence intervals for
extreme proportions
When reviewing journal articles, readers often encounter
proportions with small numerators or with numerators very
close in size to the denominators. Both situations raise the
same issue. For example, an article might assert that a treatment is safe because no serious complications occurred in the
20 patients who received it; another might claim near-perfect
sensitivity for a test that correctly identified 29 out of 30
cases of a disease. However, in many cases such articles do
not present confidence intervals for these proportions.
The first step of this tip is to learn the “rule of 3” for
zero numerators,7 and the next step is to learn an extension
(which might be called the “rule of 5, 7, 9 and 10”) for numerators of 1, 2, 3 and 4.8
Consider the following example. Twenty people undergo surgery, and none suffer serious complications. Does
this result allow us to be confident that the true complication rate is very low, say less than 5% (1 out of 20)? What
about 10% (2 out of 20)?
You will probably appreciate that if the true complication rate were 5% (1 in 20), it wouldn’t be that unusual to
observe no complications in a sample of 20, but for increasingly higher true rates, the chances of observing no complications in a sample of 20 gets increasingly smaller.
What we are after is the upper limit of a 95% confidence interval for the proportion 0/20. The following is a
simple rule for calculating this upper limit: if an event occurs 0 times in n subjects, the upper boundary of the 95%
confidence interval for the event rate is about 3/n (Table 2).
You can use the same formula when the observed proportion is 100%, by translating 100% into its complement.
For example, imagine that the authors of a study on a diagnostic test report 100% sensitivity when the test is performed for 20 patients who have the disease. That means
that the test identified all 20 with the disease as positive and
identified none as falsely negative. You would like to know
how low the sensitivity of the test could be, given that it
was 100% for a sample of 20 patients. Using the 3/n rule
Table 3: Method for obtaining an approximation of
the upper limit of the 95% CI*
Observed
numerator
0
1
2
3
4
Numerator for calculating
approximate upper limit of 95% CI
3
5
7
9
10
*For any observed numerator listed in the left hand column, divide the
corresponding numerator in the right hand column by the number of study
subjects to get the approximate upper limit of the 95% CI. For example, if the
sample size is 15 and the observed numerator is 3, the upper limit of the 95%
confidence interval is approximately 9 ÷ 15 = 0.6 or 60%.
Tips for EBM learners: confidence intervals
for the proportion of false negatives (0 out of 20), we find
that the proportion of false negatives could be as high as
15% (3 out of 20). Subtract this result from 100% to obtain
the lower limit of the 95% confidence interval for the sensitivity (in this example, 85%).
What if the numerator is not zero but is still very small?
There is a shortcut rule for small numerators other than
zero (i.e., 1, 2, 3 or 4) (Table 3).
For example, out of 20 people receiving surgery imagine
that 1 person suffers a serious complication, yielding an observed proportion of 1/20 or 5%. Using the corresponding
value from Table 3 (i.e., 5) and the sample size, we find that
the upper limit of the 95% confidence interval will be
about 5/20 or 25%. If 2 of the 20 (10%) had suffered complications, the upper limit would be about 7/20, or 35%.
References
The bottom line
7.
1.
2.
3.
4.
5.
6.
8.
Although statisticians (and statistical software) can calculate 95% confidence intervals, clinicians can readily estimate
the upper boundary of confidence intervals for proportions
with very small numerators. These estimates highlight the
greater precision attained with larger sample sizes and help
to calibrate intuitively derived confidence intervals.
Conclusions
Clinicians need to understand and interpret confidence
intervals to properly use research results in making decisions. They can use thresholds, based on differences that
patients are likely to consider important, to interpret confidence intervals and to judge whether the results are definitive or whether a larger study (with more patients and
events) is necessary. For proportions with extremely small
numerators, a simple rule is available for estimating the upper limit of the confidence interval.
This article has been peer reviewed.
From the Department of Medicine, Mayo Clinic College of Medicine, Rochester,
Minn. (Montori); the Hospital Medicine Unit, Division of General Medicine,
Emory University, Atlanta, Ga. (Kleinbart); the Departments of Epidemiology and
Biostatistics and of Pediatrics, University of California, San Francisco, San Francisco, Calif. (Newman); Durham Veterans Affairs Medical Center and Duke University Medical Center, Durham, NC (Keitz); the Columbia University College of
Physicians and Surgeons, New York, NY (Wyer); the Department of Pediatrics,
University of Texas, Houston, Tex. (Moyer); and the Departments of Medicine
and of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton,
Ont. (Guyatt)
Competing interests: None declared.
Contributors: Victor Montori, as principal author, decided on the structure and
flow of the article, and oversaw and contributed to the writing of the manuscript.
Jennifer Kleinbart reviewed the manuscript at all phases of development and contributed to the writing of tip 1. Thomas Newman developed the original idea for
tip 3 and reviewed the manuscript at all phases of development. Sheri Keitz used
all of the tips as part of a live teaching exercise and submitted comments, suggestions and the possible variations that are described in the article. Peter Wyer reviewed and revised the final draft of the manuscript to achieve uniform adherence
with format specifications. Virginia Moyer reviewed and revised the final draft of
the manuscript to improve clarity and style. Gordon Guyatt developed the original
ideas for tips 1 and 2, reviewed the manuscript at all phases of development, contributed to the writing as coauthor, and reviewed and revised the final draft of the
manuscript to achieve accuracy and consistency of content as general editor.
Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. Tips for
learners of evidence-based medicine: 1. Relative risk reduction, absolute risk
reduction and number needed to treat. CMAJ 2004;171(4):353-8.
Guyatt G, Jaeschke R, Cook D, Walter S. Therapy and understanding the results: hypothesis testing. In: Guyatt G, Rennie D, editors. Users’ guides to the
medical literature: a manual of evidence-based clinical practice. Chicago: AMA
Press; 2002. p. 329-38.
Guyatt G, Walter S, Cook D, Jaeschke R. Therapy and understanding the results: confidence intervals. In: Guyatt G, Rennie D, editors. Users’ guides to the
medical literature: a manual of evidence-based clinical practice. Chicago: AMA
Press; 2002. p. 339-49.
Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips
for learning and teaching evidence-based medicine: introduction to the series
[editorial]. CMAJ 2004;171(4):347-8.
Guyatt G, Montori V, Devereaux PJ, Schunemann H, Bhandari M. Patients at the
center: in our practice, and in our use of language. ACP J Club 2004;140:A11-2.
Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures of association. In: Guyatt G, Rennie D, editors. Users’ guides to the medical literature: a manual of evidence-based clinical practice. Chicago: AMA Press;
2002. p. 351-68.
Hanley J, Lippman-Hand A. If nothing goes wrong, is everything all right?
Interpreting zero numerators. JAMA 1983;249:1743-5.
Newman TB. If almost nothing goes wrong, is almost everything all right?
[letter]. JAMA 1995;274:1013.
Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,
Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet
.att.net
Members of the Evidence-Based Medicine Teaching Tips Working
Group: Peter C. Wyer (project director), College of Physicians and
Surgeons, Columbia University, New York, NY; Deborah Cook,
Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke,
McMaster University, Hamilton, Ont.; Rose Hatala (internal
review coordinator), University of British Columbia, Vancouver,
BC; Robert Hayward (editor, online version), Bruce Fisher,
University of Alberta, Edmonton, Alta.; Sheri Keitz (field test
coordinator), Durham Veterans Affairs Medical Center and Duke
University Medical Center, Durham, NC; Alexandra Barratt,
University of Sydney, Sydney, Australia; Pamela Charney, Albert
Einstein College of Medicine, Bronx, NY; Antonio L. Dans,
University of the Philippines College of Medicine, Manila, The
Philippines; Barnet Eskin, Morristown Memorial Hospital,
Morristown, NJ; Jennifer Kleinbart, Emory University School of
Medicine, Atlanta, Ga.; Hui Lee, formerly Group Health Centre,
Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas
McGinn, Mount Sinai Medical Center, New York, NY; Victor M.
Montori, Mayo Clinic College of Medicine, Rochester, Minn.;
Virginia Moyer, University of Texas, Houston, Tex.; Thomas B.
Newman, University of California, San Francisco, San Francisco,
Calif.; Jim Nishikawa, University of Ottawa, Ottawa, Ont.;
Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain;
W. Scott Richardson, Wright State University, Dayton, Ohio; Mark
C. Wilson, University of Iowa, Iowa City, Iowa
Articles to date in this series
Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S,
et al. Tips for learners of evidence-based medicine: 1.
Relative risk reduction, absolute risk reduction and
number needed to treat. CMAJ 2004;171(4):353-8.
CMAJ • SEPT. 14, 2004; 171 (6)
615
Correspondance
ical journals [editorial]. CMAJ 1984;130:1412.
11. Bero LA, Galbraith A, Rennie D. The publication of sponsored symposiums in medical journals. N Engl J Med 1992;327:1135-40.
Competing interests: None declared.
DOI:10.1503/cmaj.1041329
thetical trial 2 in Fig. 1B should have
been centred at 5% absolute risk reduction, as described in the text; instead, the
figure showed trial 2 as being centred at
about 6.5% absolute risk reduction. The
corrected figure is presented here.
Reference
1.
Montori VM, Kleinbart J, Newman TB, Keitz S,
Wyer PC, Moyer V, et al. Tips for learners of
evidence-based medicine: 2. Measures of precision (confidence intervals). CMAJ 2004;171(6):
611-5.
DOI:10.1503/cmaj.1041761
Online access to a
for-profit CMAJ
W
ayne Kondro, quoting CMA Secretary-General Bill Tholl, reports
that “Physicians will continue to receive
their free subscription to CMAJ as a benefit of association membership ‘for the
foreseeable future’” after CMA Publications is sold to CMA Holdings in January
2004.1 That’s all to the good — but what
then of CMAJ’s worldwide readers? Will
access to CMAJ remain free for all online
users, despite the shift to for-profit status?
I found it strange that this issue was not
addressed in Kondro’s news article.
Treatment helps
Treatment harms
Trial 1
A
-5
-3
-1
0
1
3
5
Trial 1
B
Trial 2
Adam L. Scheffler
Independent researcher
Chicago, Ill.
-5
Reference
1.
-3
-1
0
1
3
5
-1
0
1
3
5
Kondro W. CMAJ enters for-profit market.
CMAJ 2004;171(11):1334.
DOI:10.1503/cmaj.1041759
Trial 3
C
[Editor’s note]
C
MAJ’s editors have addressed the
topic of open access in this issue’s
Editorial (see page 149).
DOI:10.1503/cmaj.1041760
Trial 4
-5
-3
% Absolute risk reduction
Correction
I
n part 2 of the series “Tips for learners of evidence-based medicine”1 the
information in Fig. 1 did not fully correspond with the information provided in
the text. Specifically, the data for hypo-
162
Fig. 1: Results of 4 hypothetical trials. For the medical condition under investigation, an absolute risk reduction of 1% (double vertical rule) is the smallest benefit
that patients would consider important enough to warrant undergoing treatment. In
each case, the uppermost point of the bell curve is the observed treatment effect
(the point estimate), and the tails of the bell curve represent the boundaries of the
95% confidence interval. See the text1 for further explanation.
JAMC • 18 JANV. 2005; 172 (2)
Review
Synthèse
Tips for learners of evidence-based medicine:
3. Measures of observer variability (kappa statistic)
Thomas McGinn, Peter C. Wyer, Thomas B. Newman, Sheri Keitz, Rosanne Leipzig,
Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group
I
DOI:10.1503/cmaj.1031981
magine that you’re a busy family physician and that
you’ve found a rare free moment to scan the recent literature. Reviewing your preferred digest of abstracts,
you notice a study comparing emergency physicians’ interpretation of chest radiographs with radiologists’ interpretations.1 The article catches your eye because you have frequently found that your own reading of a radiograph differs
from both the official radiologist reading and an unofficial
reading by a different radiologist, and you’ve wondered
about the extent of this disagreement and its implications.
Looking at the abstract, you find that the authors have reported the extent of agreement using the κ statistic. You recall
that κ stands for “kappa” and that you have encountered this
measure of agreement before, but your grasp of its meaning
remains tentative. You therefore choose to take a quick glance
at the authors’ conclusions as reported in the abstract and to
defer downloading and reviewing the full text of the article.
Practitioners, such as the family physician just described,
may benefit from understanding measures of observer variability. For many studies in the medical literature, clinician
readers will be interested in the extent of agreement among
multiple observers. For example, do the investigators in a
clinical study agree on the presence or absence of physical,
radiographic or laboratory findings? Do investigators involved in a systematic overview agree on the validity of an
article, or on whether the article should be included in the
analysis? In perusing these types of studies, where investigators are interested in quantifying agreement, clinicians
will often come across the kappa statistic.
In this article we present tips aimed at helping clinical
learners to use the concepts of kappa when applying diagnostic tests in practice. The tips presented here have been
adapted from approaches developed by educators experienced in teaching evidence-based medicine skills to clinicians.2 A related article, intended for people who teach
these concepts to clinicians, is available online at www.
cmaj.ca/cgi/content/full/171/11/1369/DC1.
Clinician learners’ objectives
Defining the importance of kappa
• Understand the difference between measuring agreement and measuring agreement beyond chance.
• Understand the implications of different values of kappa.
Calculating kappa
• Understand the basics of how the kappa score is
calculated.
• Understand the importance of “chance agreement” in
estimating kappa.
Calculating chance agreement
• Understand how to calculate the kappa score given different distributions of positive and negative results.
• Understand that the more extreme the distributions of
positive and negative results, the greater the agreement
that will occur by chance alone.
• Understand how to calculate chance agreement, agreement beyond chance and kappa for any set of assessments by 2 observers.
Tip 1: Defining the importance of kappa
A common stumbling block for clinicians is the basic
concept of agreement beyond chance and, in turn, the importance of correcting for chance agreement. People making a decision on the basis of presence or absence of an element of the physical examination, such as Murphy’s sign,
will sometimes agree simply by chance. The kappa statistic
corrects for this chance agreement and tells us how much
of the possible agreement over and above chance the reviewers have achieved.
A simple example should help to clarify the importance
of correcting for chance agreement. Two radiologists independently read the same 100 mammograms. Reader 1 is
having a bad day and reads all the films as negative without looking at them in great detail. Reader 2 reads the
Teachers of evidence-based medicine:
See the “Tips for teachers” version of this article online
at www.cmaj.ca/cgi/content/full/171/11/1369/DC1. It
contains the exercises found in this article in fill-in-theblank format, commentaries from the authors on the
challenges they encounter when teaching these concepts
to clinician learners and links to useful online resources.
CMAJ • NOV. 23, 2004; 171 (11)
© 2004 Canadian Medical Association or its licensors
1369
McGinn et al
films more carefully and identifies 4 of the 100 mammograms as positive (suspicious for malignancy). How would
you characterize the level of agreement between these 2
radiologists?
The percent agreement between them is 96%, even
though one of the readers has, on cursory review, decided
to call all of the results negative. Hence, measuring the
simple percent agreement overestimates the degree of clinically important agreement in a fashion that is misleading.
The role of kappa is to indicate how much the 2 observers
agree beyond the level of agreement that could be expected
by chance. Table 1 presents a rating system that is commonly used as a guideline for evaluating kappa scores.
Purely to illustrate the range of kappa scores that readers
can expect to encounter, Table 2 gives some examples of
commonly reported assessments and the kappa scores that
resulted when investigators studied their reproducibility.
The bottom line
If clinicians neglect the possibility of chance agreement,
they will come to misleading conclusions about the reproducibility of clinical tests. The kappa statistic allows us to
measure agreement above and beyond that expected by
chance alone. Examples of kappa scores for frequently ordered tests sometimes show surprisingly poor levels of
agreement beyond chance.
Table 1: Qualitative classification
of kappa values as degree of
3
agreement beyond chance
Kappa
value
Degree of agreement
beyond chance
0
0–0.2
0.2–0.4
0.4–0.6
0.6–0.8
0.8–1.0
None
Slight
Fair
Moderate
Substantial
Almost perfect
Kappa value
Interpretation of T wave changes on an exercise
stress test4
Presence of jugular venous distension5
Detection of alcohol dependence using CAGE
questionnaire6
Presence of goitre7
Bone marrow interpretation by hematologist8
Straight leg raising test9
Diagnosis of pulmonary embolus by helical CT10
Diagnosis of lower extremity arterial disease by
arteriography11
1370
What is the maximum potential for agreement between 2 observers doing a clinical assessment, such as
presence or absence of Murphy’s sign in patients with
abdominal pain? In Fig. 1, the upper horizontal bar represents 100% agreement between 2 observers. For the hypothetical situation represented in the figure, the estimated chance agreement between the 2 observers is 50%.
This would occur if, for example, each of the 2 observers
randomly called half of the assessments positive. Given
this information, what is the possible agreement beyond
chance?
The vertical line in Fig. 1 intersects the horizontal bars
at the 50% point that we identified as the expected agreement by chance. All agreement to the right of this line corresponds to agreement beyond chance. Hence the maximum agreement beyond chance is 50% (100% – 50%).
The other number you need to calculate the kappa score
is the degree of agreement beyond chance. The observed
agreement, as shown by the lower horizontal bar in Fig. 1,
is 75%, so the degree of agreement beyond chance is 25%
(75% – 50%).
Kappa is calculated as the observed agreement beyond
chance (25%) divided by the maximum agreement beyond
chance (50%); here, kappa is 0.50.
Agreement expected
by chance
Table 2: Representative kappa values for common tests
and clinical assessments
Assessment
Tip 2: Calculating kappa
0.25
0.56
0.75
0.82–0.95
0.84
0.82
0.82
0.39–0.64
JAMC • 23 NOV. 2004; 171 (11)
50%
Observed agreement:
Observed agreement above chance:
Possible agreement
above chance
75%
25%
kappa = 25/50 = 0. 5 (moderate agreement)
Fig. 1: Two observers independently assess the presence or
absence of a finding or outcome. Each observer determines
that the finding is present in exactly 50% of the subjects. Their
assessments agree in 75% of the cases. The yellow horizontal
bar represents potential agreement (100%), and the turquoise
bar represents actual agreement. The portion of each coloured
bar that lies to the left of the dotted vertical line represents the
agreement expected by chance (50%). The observed agreement above chance is half of the possible agreement above
chance. The ratio of these 2 numbers is the kappa score.
Tips for EBM learners: kappa statistic
The bottom line
Kappa allows us to measure agreement above and beyond that expected by chance alone. We calculate kappa by
estimating the chance agreement and then comparing the
observed agreement beyond chance with the maximum
possible agreement beyond chance.
Tip 3: Calculating chance agreement
A conceptual understanding of kappa may still leave the
actual calculations a mystery. The following example is intended for those who desire a more complete understanding of the kappa statistic.
Let us assume that 2 hopeless clinicians are assessing the
presence of Murphy’s sign in a group of patients. They
have no idea what they are doing, and their evaluations are
no better than blind guesses. Let us say they are each
guessing the presence and absence of Murphy’s sign in a
50:50 ratio: half the time they guess that Murphy’s sign is
present, and the other half that it is absent. If you were
completing a 2 × 2 table, with these 2 clinicians evaluating
the same 100 patients, how would the cells, on average, get
filled in?
Fig. 2 represents the completed 2 × 2 table. Guessing at
random, the 2 hopeless clinicians have agreed on the assessments of 50% of the patients. How did we arrive at the
numbers shown in the table? According to the laws of
chance, each clinician guesses that half of the 50 patients
assessed as positive by the other clinician (i.e., 25 patients)
have Murphy’s sign.
How would this exercise work if the same 2 hopeless
clinicians were to randomly guess that 60% of the patients
had a positive result for Murphy’s sign? Fig. 3 provides the
answer in this situation. The clinicians would agree for 52
of the 100 patients (or 52% of the time) and would disagree
for 48 of the patients. In a similar way, using 2 × 2 tables
for higher and higher positive proportions (i.e., how often
Clinician 1
Clinician 2
Sign
present
Sign
absent
Total
the observer makes the diagnosis), you can figure out how
often the observers will, on average, agree by chance alone
(as delineated in Table 3).
At this point, we have demonstrated 2 things. First, even
if the reviewers have no idea what they are doing, there will
be substantial agreement by chance alone. Second, the
magnitude of the agreement by chance increases as the
proportion of positive (or negative) assessments increases.
But how can we calculate kappa when the clinicians
whose assessments are being compared are no longer
“hopeless,” in other words, when their assessments reflect a
level of expertise that one might actually encounter in practice? It’s not very hard.
Let’s take a simple example, returning to the premise
that each of the 2 clinicians assesses Murphy’s sign as being present in 50% of the patients. Here, we assume that
the 2 clinicians now have some knowledge of Murphy’s
sign and their assessments are no longer random. Each
decides that 50% of the patients have Murphy’s sign and
50% do not, but they still don’t agree on every patient.
Rather, for 40 patients they agree that Murphy’s sign is
present, and for 40 patients they agree that Murphy’s sign
is absent. Thus, they agree on the diagnosis for 80% of
the patients, and they disagree for 20% of the patients
(see Fig. 4A). How do we calculate the kappa score in this
situation?
Recall that if each clinician found that 50% of the patients
had Murphy’s sign but their decision about the presence of
the sign in each patient was random, the clinicians would be
in agreement 50% of the time, each cell of the 2 × 2 table
would have 25 patients (as shown in Fig. 2), chance agreeClinician 1
Clinician 2
Sign
present
Sign
absent
Total
Sign
present
Sign
absent
Total
25
25
50
25
25
50
50
50
Fig. 2: Agreement table for 2 hopeless clinicians who randomly
guess whether Murphy’s sign is present or absent in 100 patients with abdominal pain. Each clinician determines that half
of the patients have a positive result. The numbers in each box
reflect the number of patients in each agreement category.
Sign
present
Sign
absent
Total
36
24
60
24
16
40
60
40
Fig. 3: As in Fig. 2, the 2 clinicians again guess at random
whether Murphy’s sign is present or absent. However, each
clinician now guesses that the sign is present in 60 of the 100
patients. Under these circumstances, of the 60 patients for
whom clinician 1 guesses that the sign is present, clinician 2
guesses that it is present in 60%; 60% of 60 is 36 patients. Of
the 60 patients for whom clinician 1 guesses that the sign is
present, clinician 2 guesses that it is absent in 40%; 40% of 60
is 24 patients. Of the 40 patients for whom clinician 1 guesses
that the sign is absent, clinician 2 guesses that it is present in
60%; 60% of 40 is 24 patients. Of the 40 patients for whom
clinician 1 guesses that the sign is absent, clinician 2 guesses
that it is absent in 40%; 40% of 40 is 16 patients.
CMAJ • NOV. 23, 2004; 171 (11)
1371
McGinn et al
ment would be 50%, and maximum agreement beyond
chance would also be 50%.
The no-longer-hopeless clinicians’ agreement on 80%
of the patients is therefore 30% above chance. Kappa is a
comparison of the observed agreement above chance with
the maximum agreement above chance: 30%/50% = 60%
of the possible agreement above chance, which gives these
clinicians a kappa of 0.6, as shown in Fig. 4B.
Table 3: Chance agreement when 2
observers randomly assign positive
and negative results, for successively
higher rates of a positive call
Proportion
positive (%)
50
52
58
68
82
A
Clinician 2
Sign
present
Sign
absent
40
10
10
40
Chance agreement is not always 50%; rather, it varies
from one clinical situation to another. When the prevalence of a disease or outcome is low, 2 observers will guess
that most patients are normal and the symptom of the disease is absent. This situation will lead to a high percentage
of agreement simply by chance. When the prevalence is
high, there will also be high apparent agreement, with most
patients judged to exhibit the symptom. Kappa measures
the agreement after correcting for this variable degree of
chance agreement.
Conclusions
B
Clinician 2
Clinician 1
Sign
present
Sign
absent
Sign
present
Sign
absent
40
(25)
10
(25)
10
(25)
40
(25)
Total
50
50
Total
50
50
κ = (observed agreement – agreement expected by chance) ÷ (100 – agreement expected
by chance)
= (80% – 50%) ÷ (100% – 50%)
= 30% ÷ 50%
= 0.6
Fig. 4: Two clinicians who have been trained to assess Murphy’s sign in patients with abdominal pain do an actual assessment on 100 patients. A: A 2 × 2 table reflecting actual agreement between the 2 clinicians. B: A 2 × 2 table illustrating the
correct approach to determining the kappa score. The numbers
in parentheses correspond to the results that would be expected were each clinician randomly guessing that half of the
patients had a positive result (as in Fig. 2).
1372
Another way of expressing this formula:
(Observed agreement beyond chance) ÷ (maximum
possible agreement beyond chance)
The bottom line
Clinician 1
Sign
present
Sign
absent
(Observed agreement – agreement expected by chance) ÷
(100% – agreement expected by chance)
Hence, to calculate kappa when only 2 alternatives are
possible (e.g., presence or absence of a finding), you need
just 2 numbers: the percentage of patients that the 2 assessors agreed on and the expected agreement by chance.
Both can be determined by constructing a 2 × 2 table exactly as illustrated above.
Agreement
by chance (%)
50
60
70
80
90
Formula for calculating kappa
JAMC • 23 NOV. 2004; 171 (11)
Armed with this understanding of kappa as a measure of
agreement between different observers, you are able to return to the study of agreement in chest radiography interpretations between emergency physicians and radiologists1
in a more informed fashion. You learn from the abstract
that the kappa score for overall agreement between the 2
classes of practitioners was 0.40, with a 95% confidence
interval ranging from 0.35 to 0.46. This means that the
agreement between emergency physicians and radiologists
represented 40% of the potentially achievable agreement
beyond chance. You understand that this kappa score
would be conventionally considered to represent fair to
moderate agreement but is inferior to many of the kappa
values listed in Table 2. You are now much more confident
about going to the full text of the article to review the
methods and assess the clinical applicability of the results to
your own patients.
The ability to understand measures of variability in data
presented in clinical trials and systematic reviews is an important skill for clinicians. We have presented a series of
tips developed and used by experienced teachers of evidence-based medicine for the purpose of facilitating such
understanding.
Tips for EBM learners: kappa statistic
This article has been peer reviewed.
From the Department of Medicine, Division of General Internal Medicine
(McGinn), and the Department of Geriatrics (Leipzig), Mount Sinai Medical Center, New York, NY; the Columbia University College of Physicians and Surgeons,
New York, NY (Wyer); the Departments of Epidemiology and Biostatistics and of
Pediatrics, University of California, San Francisco, San Francisco, Calif. (Newman); Durham Veterans Affairs Medical Center and Duke University Medical
Center, Durham, NC (Keitz); and the Departments of Medicine and of Clinical
Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt)
Competing interests: None declared.
Contributors: Thomas McGinn developed the original idea for tips 1 and 2 and, as
principal author, oversaw and contributed to the writing of the manuscript.
Thomas Newman and Roseanne Leipzig reviewed the manuscript at all phases of
development and contributed to the writing as coauthors. Sheri Keitz used all of
the tips as part of a live teaching exercise and submitted comments, suggestions
and the possible variations that are described in the article. Peter Wyer reviewed
and revised the final draft of the manuscript to achieve uniform adherence with
format specifications. Gordon Guyatt developed the original idea for tip 3, reviewed the manuscript at all phases of development, contributed to the writing as a
coauthor, and, as general editor, reviewed and revised the final draft of the manuscript to achieve accuracy and consistency of content.
References
1. Gatt ME, Spectre G, Paltiel O, Hiller N, Stalnikowicz R. Chest radiographs
in the emergency department: Is the radiologist really necessary? Postgrad
Med J 2003;79:214-7.
2. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips
for learning and teaching evidence-based medicine: introduction to the series
[editorial]. CMAJ 2004;171(4):347-8.
3. Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic.
Am J Epidemiol 1987;126:161-9.
4. Blackburn H. The exercise electrocardiogram: differences in interpretation.
Report of a technical group on exercise electrocardiography. Am J Cardiol
1968;21:871-80.
5. Cook DJ. Clinical assessment of central venous pressure in the critically ill.
Am J Med Sci 1990;299:175-8.
6. Aertgeerts B, Buntinx F, Fevery J, Ansoms S. Is there a difference between
CAGE interviews and written CAGE questionnaires? Alcohol Clin Exp Res
2000;24:733-6.
7. Kilpatrick R, Milne JS, Rushbrooke M, Wilson ESB. A survey of thyroid enlargement in two general practices in Great Britain. BMJ 1963;1:29-34.
8. Guyatt GH, Patterson C, Ali M, Singer J, Levine M, Turpie I, et al. Diagnosis of iron-deficiency anemia in the elderly. Am J Med 1990;88:205-9.
9. McCombe PF, Fairbank JC, Cockersole BC, Pynsent PB. 1989 Volvo Award
in clinical sciences. Reproducibility of physical signs in low-back pain. Spine
1989;14:908-18.
10. Perrier A, Howarth N, Didier D, Loubeyre P, Unger PF, de Moerloose P, et
al. Performance of helical computed tomography in unselected outpatients
with suspected pulmonary embolism. Ann Intern Med 2001;135:88-97.
11. Koelemay MJ, Legemate DA, Reekers JA, Koedam NA, Balm R, Jacobs MJ.
Interobserver variation in interpretation of arteriography and management of
severe lower leg arterial disease. Eur J Vasc Endovasc Surg 2001;21:417-22.
Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,
Pelham NY 10803, USA; fax 914 738-9368; pwyer@att.net
Members of the Evidence-Based Medicine Teaching Tips
Working Group: Peter C. Wyer (project director), College of
Physicians and Surgeons, Columbia University, New York, NY;
Deborah Cook, Gordon Guyatt (general editor), Ted Haines,
Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose
Hatala (internal review coordinator), University of British
Columbia, Vancouver, BC; Robert Hayward (editor, online
version), Bruce Fisher, University of Alberta, Edmonton, Alta.;
Sheri Keitz (field test coordinator), Durham Veterans Affairs
Medical Center and Duke University Medical Center, Durham,
NC; Alexandra Barratt, University of Sydney, Sydney, Australia;
Pamela Charney, Albert Einstein College of Medicine, Bronx, NY;
Antonio L. Dans, University of the Philippines College of
Medicine, Manila, The Philippines; Barnet Eskin, Morristown
Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory
University School of Medicine, Atlanta, Ga.; Hui Lee, formerly
Group Health Centre, Sault Ste. Marie, Ont. (deceased); Rosanne
Leipzig, Thomas McGinn, Mount Sinai Medical Center, New
York, NY; Victor M. Montori, Mayo Clinic College of Medicine,
Rochester, Minn.; Virginia Moyer, University of Texas, Houston,
Tex.; Thomas B. Newman, University of California, San
Francisco, San Francisco, Calif.; Jim Nishikawa, University of
Ottawa, Ottawa, Ont.; Kameshwar Prasad, Arabian Gulf
University, Manama, Bahrain; W. Scott Richardson, Wright State
University, Dayton, Ohio; Mark C. Wilson, University of Iowa,
Iowa City, Iowa
Articles to date in this series
Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz
S, et al. Tips for learners of evidence-based medicine:
1. Relative risk reduction, absolute risk reduction and
number needed to treat. CMAJ 2004;171(4):353-8.
Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC,
Moyer V, et al. Tips for learners of evidence-based
medicine: 2. Measures of precision (confidence intervals). CMAJ 2004;171(6):611-5.
CMAJ • NOV. 23, 2004; 171 (11)
1373
Review
Synthèse
Tips for learners of evidence-based medicine:
4. Assessing heterogeneity of primary studies
in systematic reviews and whether to combine
their results
Rose Hatala, Sheri Keitz, Peter Wyer, Gordon Guyatt, for the Evidence-Based Medicine
Teaching Tips Working Group
DOI:10.1503/cmaj.1031920
C
linicians wishing to quickly answer a clinical question
may seek a systematic review, rather than searching
for primary articles. Such a review is also called a
meta-analysis when the investigators have used statistical
techniques to combine results across studies. Databases useful for this purpose include the Cochrane Library (www.
thecochranelibrary.com) and the ACP Journal Club (www.
acpjc.org; use the search term “review”), both of which are
available through personal or institutional subscription.
Clinicians can use systematic reviews to guide clinical practice if they are able to understand and interpret the results.
Systematic reviews differ from traditional reviews in that
they are usually confined to a single focused question,
which serves as the basis for systematic searching, selection
and critical evaluation of the relevant research.1 Authors of
systematic reviews use explicit methods to minimize bias
and consider using statistical techniques to combine the results of individual studies. When appropriate, such pooling
allows a more precise estimate of the magnitude of benefit
or harm of a therapy. It may also increase the applicability
of the result to a broader range of patient populations.
Clinicians encountering a meta-analysis frequently find
the pooling process mysterious. Specifically, they wonder
how authors decide whether the ranges of patients, interventions and outcomes are too broad to sensibly pool the
results of the primary studies.
In this article we present an approach to evaluating potentially important differences in the results of individual
studies being considered for a meta-analysis. These differences are frequently referred to as heterogeneity.1 Our discussion focuses on the qualitative, rather than the statistical, assessment of heterogeneity (see Box 1).
Two concepts are commonly implied in the assessment
of heterogeneity. The first is an assessment for heterogeneity within 4 key elements of the design of the original studies: the patients, interventions, outcomes and methods. This
assessment bears on the question of whether pooling the results is at all sensible. The second concept relates to assessing heterogeneity among the results of the original studies.
Even if the study designs are similar, the researchers must
decide whether it is useful to combine the primary studies’
results. Our discussion assumes a basic familiarity with how
investigators present the magnitude2,3 and precision4 of
treatment effects in individual randomized trials.
The tips in this article are adapted from approaches developed by educators with experience in teaching evidencebased medicine skills to clinicians.1,5,6 A related article, intended for people who teach these concepts to clinicians, is
available online at www.cmaj.ca/cgi/content/full/172/5/
661/DC1.
Clinician learners’ objectives
Qualitative assessment of the design of primary
studies
• Understand the concepts of heterogeneity of study design among the individual studies included in a systematic review.
Qualitative assessment of the results of primary
studies
• Understand how to qualitatively determine the appropriateness of pooling estimates of effect from the individual studies by assessing (1) the degree of overlap of
the confidence intervals around these point estimates of
effect and (2) the disparity between the point estimates
themselves.
• Understand how to estimate the “true” value of the estimate of effect from a graphic display of the results of
individual studies.
Teachers of evidence-based medicine:
See the “Tips for teachers” version of this article online
at www.cmaj.ca/cgi/content/full/172/5/661/DC1. It
contains the exercises found in this article in fill-in-theblank format, commentaries from the authors on the
challenges they encounter when teaching these concepts
to clinician learners and links to useful online resources.
CMAJ • MAR. 1, 2005; 172 (5)
© 2005 CMA Media Inc. or its licensors
661
Hatala et al
Box 1: Statistical assessments of heterogeneity
Meta-analysts typically use 2 statistical approaches to evaluate
the extent of variability in results between studies: Cochran’s
Q test and the I 2 statistic.
Cochran’s Q test
• Cochran’s Q test is the traditional test for heterogeneity. It
begins with the null hypothesis that all of the apparent
variability is due to chance. That is, the true underlying
magnitude of effect (whether measured with a relative risk,
an odds ratio or a risk difference) is the same across studies.
• The test then generates a probability, based on a χ2
distribution, that differences in results between studies as
extreme as or more extreme than those observed could occur
simply by chance.
• If the p value is low (say, less than 0.1) investigators should
look hard for possible explanations of variability in results
between studies (including differences in patients,
interventions, measurement of outcomes and study design).
• As the p value gets very low (less than 0.01) we may be
increasingly uncomfortable about using single best estimates
of treatment effects.
• The traditional test for heterogeneity is limited, in that it may
be underpowered (when studies have included few patients it
may be difficult to reject the null hypothesis even if it is false)
or overpowered (when sample sizes are very large, small and
unimportant differences in magnitude of effect may
nevertheless generate low p values).
I 2 statistic
• The I 2 statistic, the second approach to measuring
heterogeneity, attempts to deal with potential underpowering
or overpowering. I 2 provides an estimate of the percentage of
variability in results across studies that is likely due to true
differences in treatment effect, as opposed to chance.
• When I 2 is 0%, chance provides a satisfactory explanation for
the variability we have observed, and we are more likely to
be comfortable with a single pooled estimate of treatment
effect.
• As I 2 increases, we get increasingly uncomfortable with a
single pooled estimate, and the need to look for explanations
of variability other than chance becomes more compelling.
• For example, one rule of thumb characterizes I 2 of less than
0.25 as low heterogeneity, 0.25 to 0.5 as moderate
heterogeneity and over 0.5 as high heterogeneity.
Tip 1: Qualitative assessment of the design of
primary studies
Consider the following 3 hypothetical systematic reviews. For which of these systematic reviews does it make
sense to combine the primary studies?
• A systematic review of all therapies for all types of cancer, intended to generate a single estimate of the impact
of these therapies on mortality.
• A systematic review that examines the effect of different
antibiotics, such as tetracyclines, penicillins and chloramphenicol, on improvement in peak expiratory flow
rates and days of illness in patients with acute exacerbation of obstructive lung disease, including chronic
bronchitis and emphysema.7
• A systematic review of the effectiveness of tissue plasminogen activator (tPA) compared with no treatment
or placebo in reducing mortality among patients with
acute myocardial infarction.8
Most clinicians would instinctively reject the first of
these proposed reviews as overly broad but would be comfortable with the idea of combining the results of trials relevant to the third question. What about the second review?
What aspects of the primary studies must be similar to justify combining their results in this systematic review?
Table 1 lists features that would be relevant to the
question considered in the second review and categorizes
them according to the 4 key elements of study design: the
patients, interventions, outcomes and methods of the primary studies. Combining results is appropriate when the
biology is such that across the range of patients, interventions, outcomes and study methods, one can anticipate
more or less the same magnitude of treatment effect.
In other words, the judgement as to whether the primary studies are similar enough to be combined in a systematic review is based on whether the underlying pathophysiology would predict a similar treatment effect across
the range of patients, interventions, outcomes and study
methods of the primary studies. If you think back to the
first systematic review — all therapies for all cancers — you
probably recognize that there is significant variability in the
Table 1: Relevant features of study design to be considered when deciding whether to pool studies in a
systematic review (for a review examining the effect of antibiotics in patients with obstructive lung disease)
Patients
Patient age
Patient sex
Type of lung disease
(e.g., emphysema,
chronic bronchitis)
662
Interventions
Outcomes
Study methods
Same antibiotic in all studies
Same class of antibiotic in all
studies
Comparison of antibiotic with
placebo
Comparison of one antibiotic with
another
Death
Peak expiratory flow
Forced expiratory volume in
the first second
All randomized trials
Only blinded randomized
trials
Cohort studies
JAMC • 1er MARS 2005; 172 (5)
Tips for EBM learners: heterogeneity
pathophysiology of different cancers (“patients” in Table 1)
and in the mechanisms of action of different cancer therapies (“interventions” in Table 1).
If you were inclined to reject pooling the results of the
studies to be considered in the second systematic review, you
might have reasoned that we would expect substantially different effects with different antibiotics, different infecting
agents or different underlying lung pathology. If you were
inclined to accept pooling of results in this review, you might
argue that the antibiotics used in the different studies are all
effective against the most common organisms underlying
pulmonary exacerbations. You might also assert that the biology of an acute exacerbation of an obstructive lung disease
(e.g., inflammation) is similar, despite variability in the underlying pathology. In other words, we would expect more
or less the same effect across agents and across patients.
Finally, you probably accepted the validity of pooling results for the third systematic review — tPA for myocardial
infarction — because you consider that the mechanism of
myocardial infarction is relatively constant across a broad
range of patients.
left of the “no difference” line indicate that the treatment is
superior to the control, whereas those to the right of the line
indicate that the control is superior to the treatment. For
each of the 4 studies represented in the figures, the dot represents the point estimate of the treatment effect (the value
observed in the study), and the horizontal line represents the
confidence interval around that observed effect. For which
systematic review does it make sense to combine results? Decide on the answer to this question before you read on.
You have probably concluded that pooling is appropriate
A
The bottom line
• Similarity in the aspects of primary study design outlined in Table 1 (patients, interventions, outcomes,
study methods) guides the decision as to whether it
makes sense to combine the results of primary studies
in a systematic review.
• The range of characteristics of the primary studies
across which it is sensible to combine results is a matter
of judgment based on the researcher’s understanding of
the underlying biology of the disease.
Favours new
treatment
No difference
Favours control
Favours
new treatment
No difference
Favours control
B
Tip 2: Qualitative assessment of the results of
primary studies
You should now understand that combining the results of
different studies is sensible only when we expect more or less
the same magnitude of treatment effects across the range of
patients, interventions and outcomes that the investigators
have included in their systematic review. However, even
when we are confident of the similarity in design among the
individual studies, we may still wonder whether the results of
the studies should be pooled. The following graphic demonstration shows how to qualitatively assess the results of the
primary studies to decide if meta-analysis (i.e., statistical
pooling) is appropriate. You can find discussions of quantitative, or statistical, approaches to the assessment of heterogeneity elsewhere (see Box 1 or Higgins and associates9).
Consider the results of the studies in 2 hypothetical systematic reviews (Fig. 1A and Fig. 1B). The central vertical
line, labelled “no difference,” represents a treatment effect of
0. This would be equivalent to a risk ratio or relative risk of 1
or an absolute or relative risk reduction of 0.2 Values to the
Fig. 1: Results of the studies in 2 hypothetical systematic reviews. The central vertical line represents a treatment effect of
0. Values to the left of this line indicate that the treatment is superior to the control, whereas those to the right of the line indicate that the control is superior to the treatment. For each of
the 4 studies in each figure, the dot represents the point estimate of the treatment effect (the value observed in the study),
and the horizontal line represents the confidence interval
around that observed effect.
CMAJ • MAR. 1, 2005; 172 (5)
663
Hatala et al
for the studies represented in Fig. 1B but not for those represented in Fig. 1A. Can you explain why? Is it because the
point estimates for the studies in Fig. 1A lie on opposite sides
No difference
Favours
new treatment
Favours control
Fig. 2: Point estimates and confidence intervals for 4 studies.
Two of the point estimates favour the new treatment, and the
other 2 point estimates favour the control. Investigators doing a
systematic review with these 4 studies would be satisfied that it
is appropriate to pool the results.
Pooled estimate of underlying effect
Favours
new treatment
No difference
Favours control
Fig. 3: Results of the hypothetical systematic review presented
in Fig. 1B. The pooled estimate at the bottom of the chart (large
diamond) provides the best guess as to the underlying treatment effect. It is centred on the midpoint of the area of overlap
of the confidence intervals around the estimates of the individual trials.
664
JAMC • 1er MARS 2005; 172 (5)
of the “no difference” line, whereas those for the studies in
Fig. 1B lie on the same side of the “no difference” line?
Before you answer this question, consider the studies
represented in Fig. 2. Here, the point estimates of 2 studies
are on the “favours new treatment” side of the “no difference” line, and the point estimates of 2 other studies are on
the “favours control” side. However, all 4 point estimates
are very close to the “no difference” line, and, in this case,
investigators doing a systematic review will be satisfied that
it is appropriate to pool the results. Therefore, it is not the
position of the point estimates relative to the “no difference” line that determines the appropriateness of pooling.
There are 2 criteria for not combining the results of
studies in a meta-analysis: highly disparate point estimates
and confidence intervals with little overlap, both of which
are exemplified by Fig. 1A. When pooling is appropriate on
the basis of these criteria, where is the best estimate of the
underlying magnitude of effect likely to be? Look again at
Fig. 1B and make a guess. Now look at Fig. 3.
The pooled estimate at the bottom of Fig. 3 is centred on
the midpoint of the area of overlap of the confidence intervals
around the estimates of the individual trials. It provides our
best guess as to the underlying treatment effect. Of course, we
cannot actually know the “truth” and must be content with
potentially misleading estimates. The intent of a meta-analysis
is to include enough studies to narrow the confidence interval
around the resulting pooled estimate sufficiently to provide estimates of benefit for our patients in which we can be confident. Thus, our best estimate of the truth will lie in the area of
overlap among the confidence intervals around the point estimates of treatment effect presented in the primary studies.
What is the clinician to do when presented with results
such as those in Fig. 1A? If the investigators have done a
good job of planning and executing the meta-analysis, they
will provide some assistance.6 Before examining the study
results in detail, they will have generated a priori hypotheses
to explain the heterogeneity in magnitude of effect across
studies that they are liable to encounter. These hypotheses
will include differences in patients (effects may be larger in
sicker patients), in interventions (larger doses may result in
larger effects), in outcomes (longer follow-up may diminish
the magnitude of effect) and in study design (methodologically weaker studies may generate larger effects).
The investigators will then have examined the extent to
which these hypotheses can explain the differences in magnitude of effect across studies. These subgroup analyses
may be misleading, but if they meet 7 criteria suggested
elsewhere10 (see Box 2), they may provide credible and satisfying explanations for the variability in results.
The bottom line
• Readers can decide for themselves whether there is
clinically important heterogeneity among the results of
primary studies through a qualitative assessment of the
graphic results. This assessment is based on the amount
Tips for EBM learners: heterogeneity
Box 2: Questions to ask when evaluating a subgroup
10
analysis in a …