FinalResultsInformationfrommystudy MarkingRubricandMarkingCriteria Reference1 Methodconductedformystudy Structureforreport Reference2
Testing potentiates new learning across a retention interval
PSYC1001 Research Report Assignment 2019 Results INFORMATION
NB: This is NOT a results section. It is not APA formatted, nor is it easy to understand (for a reader/marker).
You need to write a results section for your research report, but in the correct APA format (sentences, table/
graph), so DO NOT copy any section of this document into your research report. Refer to the third Report
Writing Module for assistance.
Mean Recall Rate/Proportion of words retrieved from LIST 3 (number of words recalled
out of total words, e.g. with a list size of 12, if 6 words were recalled the score would be 0.500)
A: Retrieval M=0.6152
B: Restudy (control) M=0.3505
C: List Discrimination M=0.487
D: Categorical Judgements M=0.3047
Statistical Comparisons
A vs B: p=0.00001
C vs B: p=0.022
D vs B: p=0.44
Optional
A vs C: p=0.031
Redundant
A vs D: p=.00001
C vs D: p=0.003
First Year Psychology Report Marking Rubric and Marking Criteria
Please refer to the tutorial materials and writing modules for more instructions on writing your report.
Rubric item 60 Satisfactory Add marks for Subtract marks for Weight
Abstract Abstract is 100‐150 words and consists
of four or five concise sentences which
describe in turn: Background to the
study, What was done, What was
found, What it means. Content of the
abstract accurately describes the
content of the report and all its
subsections including its conclusion in
an appropriate style. Details
concerning the study are correct.
Abstract is easy to understand
(terminology defined or
excluded).
Abstract is efficient but also
contains sufficient (correct) detail
about the study.
Abstract gives a good sense of
the contribution the study makes
to the field.
Abstract is impossible or difficult to
understand (uses condition labels
or terminology which is only
discussed in the report).
Abstract contains too much detail
(e.g. means or p‐values) or too little
detail (vague statements about
findings).
Abstract is too long or too short or
is missing a key component
(background or method or findings
or impact).
10%
Introduction
Literature
Use
Introduction adequately introduces
and describes previous research in
sufficient detail to make its relevance
to the current research obvious and to
place the current research in the
appropriate context.
The details (methods and results)
of previous research are
presented, contrasted, and
integrated into the argument.
Different papers, their methods
and findings are integrated
together.
Few (if any) details of prior research
are mentioned – only findings (i.e.
only information from abstracts and
conclusions is used) which means
integration is at a crude level.
Different papers are described
sequentially only.
10%
Introduction
case for new
study
Introduction describes and justifies the
current study in the context of a need
arising from the previous literature or a
need for more information.
Previous literature is chosen and
described in a manner which
makes the current study sound
critically important and distinct
from prior research.
Less clear why the current study
needed to be done or how it will
answer the questions posed, or how
it even differs from prior research.
Low level of detail in description or
justification of study.
10%
Introduction
hypotheses
The introduction finishes with clear
predictions for the outcome of the
study – and justifications for the
predictions based on previous findings
or theories.
Rather than just mention prior
conclusions; specific prior results
and their context (i.e. different
methods) are used to argue for
precise outcomes. Multiple prior
results (from different studies)
might be used to argue for a
pattern of results.
Justification of hypotheses are poor
because they are at the wrong level
of detail. Citations are used in place
of logic and method/results details,
as if to say ‘this was found before so
will be found again’ – emphasising
the lack of originality of the study.
10%
Results APA
sentences
Results section accurately describes
key findings in full sentences which
stand independently. A reader is able
to accurately determine the basic un‐
interpreted meaning, direction and
statistical significance (p‐values used
appropriately) of all key findings
without referring back to the method
section or the graph/table.
Results writing is concise – no
more sentences than are needed.
Results are presented in a
manner which makes them easier
to relate to hypotheses. Use of p‐
values and APA language is
perfect.
Results section much longer than it
needs to be (e.g. sentences
describing results which do not cite
p‐values). Incorrect use of p‐values.
Sentences impossible to understand
without method (e.g. use of
Condition labels). Direction of
effects absent or unclear.
10%
Results graph
or table
Results section graphs or tabulates key
findings in a way which makes them
easier to understand. Table or graph is
APA format, clearly titled; axes or
columns are clearly titled; a Figure or
Table caption describes the content
accurately; the appropriate kind of
table or graph is used. The Table or
Graph is referred to in the text and
corresponds to the way results are
described.
Choice and formatting of table or
graph emphasises key results and
is a good match for the
hypotheses and later discussion.
No errors at all in formatting.
Table or graph is wrong (e.g. a
graph or table of p‐values). Table or
graph does not make results easier
to grasp (does not clearly show
findings relevant to hypotheses), is
not referred to in the text, axes
aren’t labelled, is not APA
formatted.
10%
Discussion
results
literature
integration
Results are discussed in relation to the
previous literature (as reviewed in the
introduction) in sufficient detail to
make it clear what the contribution of
the study has been to the field.
Using the discrepancies in
methods between the current
and prior studies, a strong case is
made for the importance of the
current study and its findings.
Citations are used in place of logic
and method/results details, as if to
say ‘this was found before so has
been found again’ – emphasising
the lack of originality of the study.
10%
Discussion
limitations,
consequences
for future
research
The findings and conclusions of the
study are placed in the context of the
field. The reach and impact of the
findings are qualified with appropriate
humility without diminishing emphasis
on their importance.
Distinctions in methods between
the current and previous studies
are used to justify precise
qualifications about the impact of
results. Where design
shortcoming are discussed,
solutions are offered in the form
of new design features.
Less detailed integration of
findings/methods with existing
literature means that limitations
discussed are generic (e.g. sample
or sample size), and solutions are
just as generic (e.g. get a
different/bigger sample). Criticisms
are so strong or crude as to
undermine the purpose of the study
entirely. Criticisms don’t make
sense or are not ‘followed through’
for their actual impact on results
(e.g. ‘the room was noisy…’)
10%
APA
Formatting
See table below for details.
The correct report STRUCTURE is used, with all sections present and in the correct order. The correct FONT is
used throughout and sections are ALIGNED correctly. Correct HEADINGS are used. Formal APA style LANGUAGE is
used throughout. SPACING is correct and INDENTS are correct (for both the main and references sections). There
is no 60 anchor for this rubric item – a full score of is attainable with all components correct.
20%
APA formatting
Component Instructions Marks
/20
STRUCTURE Abstract [4 or 5 sentences; concise description of all sections of report; 100‐150 words]
Introduction paragraph 1 [Summary of argument/approach; key definitions; overview of field]
Introduction paragraph 2 [Review of key literature]
Introduction paragraph 3 [Description and justification of study, promotion of its novelty/importance]
Introduction paragraph 4 [Derivation and justification of hypotheses]
Results [Precise description of all key results]
Results graph or table [APA formatted table or graph]
Discussion paragraph 1 [Review of all key results]
Discussion paragraphs 2‐3 [Integration of results with prior literature]
Discussion paragraph 3‐4 [Qualification of results and their impact]
References section [All cited works appear here in alphabetical order]
NB: Content of these sections is assessed with other scores. This score is just for ‘are these sections
present and in the correct order?’ If the clear purpose of each paragraph and order is present, there is
no penalty from slightly different paragraph structure. (e.g. no hypotheses = no marks; hypotheses
blended with study description = no penalty)
4
FONT and
ALIGNMENT
Times New Roman 12 point font throughout
All sections left aligned or justified. No sections are centre aligned.
4
HEADINGS Abstract is titled ‘Abstract’, Introduction has no title, Results labelled ‘Results’, Discussion labelled
‘Discussion’ , References labelled ‘References’.
4
SPACING and
INDENTS
Double spaced throughout. NO additional space between paragraphs (turn this off in WORD)
No indent for abstract. For introduction, results and discussion, first line of each paragraph indented.
References section has a reverse indent.
4
LANGUAGE Formal APA language, no use of first person or colloquial language. Does not include English errors or
grammatical errors (to the extent language is uninterpretable, other rubric items will be affected)
4
Marks /4 are all or none for each section. E.g. indents are correct everywhere except the references section = 0/4
Word count rules
Research report is to be 1150 words.
Counted in the word limit: Abstract, Introduction, Results, Discussion, All figure captions, All figure text, All quotes (do not use), All in text
citations, All endnotes and footnotes (do not use), All Headings, anything else you want the marker to consider for marks except the
references section.
Not counted in the word limit: References section.
If in doubt: Select all text from the beginning of your abstract to the end of your discussion. That is your word count.
Margin for error: 5% (1092 ‐ 1208 words)
Penalty for exceeding: Marker will not read or consider for marks any text after 1208 words is reached.
Penalty for below minimum word count: Not considered a serious attempt.
You should not use quotes of any kind (It’s like saying to the marker: “I did not understand this” or “I couldn’t be bothered putting this into my
own words”). If you use so many quotes that your actual written contribution is greatly reduced, then your report will not be considered a
serious attempt regardless of word count.
Joshua W. Whiffen and Jeffrey D. Karpicke
Purdue University
The episodic context account of retrieval-based learning proposes that retrieval enhances subsequent
retention because people must think back to and reinstate a prior learning context. Three experiments
directly tested this central assumption of the context account. Subjects studied word lists and then either
restudied the words under intentional learning conditions or made list discrimination judgments by
indicating which list each word had occurred in originally. Subjects in both conditions experienced all
items for the same amount of time, but subjects in the list discrimination condition were required to
retrieve details about the original episodic context in which the words had occurred. Making initial list
discrimination judgments consistently enhanced subsequent free recall relative to restudying the words.
Analyses of recall organization and retrieval strategies on the final test showed that retrieval practice
enhanced temporal organization during final recall. Semantic encoding tasks also enhanced retention
relative to restudying but did so by promoting semantic organization and semantically based retrieval
strategies during final recall. The results support the episodic context account of retrieval-based learning.
Keywords: memory, retrieval practice, testing effect
A wealth of recent research has examined the effects of retrieval
practice on learning. When people retrieve items on an initial test,
the act of initial retrieval enhances subsequent retention. Thus, the
act of retrieval alters memory, making retrieved items more re-
trievable in the future. Retrieval practice effects are robust and
have been explored with a variety of materials in a range of
settings (for recent reviews, see Nunes & Karpicke, 2015; Row-
land, 2014). However, there is still considerable room for progress
in understanding the mechanisms of retrieval-based learning.
One recent theory of retrieval-based learning is the episodic
context account (Karpicke, Lehman, & Aue, 2014; Lehman,
Smith, & Karpicke, 2014), which explains retrieval practice effects
on the basis of four central assumptions. First, people encode
information about items and the temporal/episodic context in
which those items occurred (Howard & Kahana, 2002). Second,
during retrieval, people attempt to reinstate the episodic context
associated with an item as part of a memory search process
(Lehman & Malmberg, 2013). Third, when an item is successfully
retrieved, the context representation associated with that item is
updated to include features of the original study context and
features of the present test context. Finally, when people attempt to
retrieve items again on a later test, the updated context represen-
tations aid in recovery of those items, and memory performance is
improved.
The context theory can account for several key findings in the
retrieval practice literature. For example, one consistent finding is
that spaced retrieval produces better retention than does massed
retrieval (Roediger & Karpicke, 2011). The context account pro-
poses that temporal context will have changed more during a
spaced repetition than during a massed one, so spaced retrieval
may require a greater degree of context reinstatement relative to
massed retrieval. Spaced retrieval may also yield updated context
representations that are more distinctive than those produced by
massed retrieval (Karpicke et al., 2014). The context account also
helps explain the positive effects of “effortful” initial retrieval
tasks. Specifically, free recall tests tend to produce larger retrieval
practice effects than do recognition tests (Glover, 1989); practicing
retrieval with weakly associated cues produces larger effects rel-
ative to practicing retrieval with strong associates (Carpenter,
2009); and initial recall with only the first letter of a target as a cue
produces larger retrieval practice effects than does initial recall
with three letters of the target (Carpenter & DeLosh, 2006). In all
cases, the conditions that produce larger retrieval practice effects
(freely recalling, recalling with weak cues, and recalling with
fewer letter cues) are ones that require learners to engage in greater
degrees of context reinstatement during initial retrieval.
The episodic context account also helps explain the role of
retrieval mode in retrieval practice effects. Retrieval mode refers to
the cognitive state in which people intentionally think back to a
particular place and time when an event occurred (Tulving, 1983).
Experiments by Karpicke and Zaromb (2010) established the im-
portance of retrieval mode for retrieval-based learning. In those
experiments, subjects studied a list of target words (e.g., love) and
This article was published Online First January 12, 2017.
Joshua W. Whiffen and Jeffrey D. Karpicke, Department of Psycholog-
ical Sciences, Purdue Universit
y.
This research was supported in part by grants from the National Science
Foundation (DRL-1149363 and DUE-1245476) and the Institute of Edu-
cation Sciences in the U.S. Department of Education (R305A110903 and
R305A150546). The opinions expressed are those of the authors and do not
represent the views of the National Science Foundation, the Institute of
Education Sciences, or the U.S. Department of Education. We thank Nola
Daley and Nick Counger for help collecting the data, Philip Grimaldi for
help with computer programming, and James Nairne and Greg Francis for
comments.
Correspondence concerning this article should be addressed to Jef-
frey D. Karpicke, Department of Psychological Sciences, Purdue Uni-
versity, 703 Third Street, West Lafayette, IN 47907-2081. E-mail:
karpicke@purdue.edu
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
Journal of Experimental Psychology:
Learning, Memory, and Cognition
© 2017 American Psychological Association
2017, Vol. 43, No. 7,
1036
–1046
0278-7393/17/$12.00 http://dx.doi.org/10.1037/xlm0000379
1036
mailto:karpicke@purdue.edu
http://dx.doi.org/10.1037/xlm0000379
then restudied the targets paired with related cues (e.g., heart-love)
or saw cues and fragments of the targets (e.g., heart-l_v_). In one
condition, subjects were told to generate words that would com-
plete each fragment but were not told to think back to the study
phase. In a second condition, subjects were placed in an episodic
retrieval mode: They were told to think back to the study phase and
complete the fragments with words they had studied. On final free
recall and item recognition tests, both fragment-completion con-
ditions tended to outperform the restudy condition. Most impor-
tantly, intentionally retrieving the target words produced larger
gains on the final test relative to generating the target words
without recollecting the study episode (see too Pu & Tse, 2014).
Thus, reinstating the original episodic context during the practice
phase enhanced subsequent retention.
Although the episodic context account helps explain several key
findings about retrieval practice, few studies have directly tested
predictions derived from the account. The present experiments
examined a central prediction: With all else held constant, if
people experience items and are required to think back to an
original study episode, the act of doing so should enhance subse-
quent retention relative to experiencing the items but not thinking
back to a study episode. The present experiments accomplished
this by using a list discrimination task. To implement retrieval
practice, subjects were shown a list of words and indicated which
list the word had occurred in during the first phase of the exper-
iment. Prior studies have examined the effects of initial retrieval
practice on later list discrimination performance (e.g., Brewer,
Marsh, Meeks, Clark-Foos, & Hicks, 2010; Chan & McDermott,
2007; Verkoeijen, Tabbers, & Verhage, 2011). Here, list discrim-
ination was used as a retrieval practice task that required subjects
to think back to and reinstate the original episodic context.
The list discrimination task used in the present experiments
circumvents a methodological problem that often exists in retrieval
practice research. In many experiments, while subjects in restudy
conditions reexperience the entire set of items, subjects in retrieval
practice conditions reexperience only the items they are able to
recall. Thus, reexposure to items is not equated in restudy and
retrieval practice conditions. For example, in Karpicke and
Zaromb’s (2010) experiments, subjects recalled approximately
70% to 75% of the target words during initial retrieval practice,
whereas they reexperienced 100% of the targets in the restudy
condition (see Karpicke et al., 2014, for further discussion of this
issue). In the present experiments, subjects in all conditions reex-
perienced all items for the same amount of time. The only differ-
ence between the restudy and retrieval practice conditions was
whether subjects were told to restudy the words or whether they
were required to recollect the study episode by making list dis-
crimination judgments.
The three experiments reported here used the same general
procedure. First, subjects studied two short lists of words. Next,
they were represented with the words from both lists mixed to-
gether. In a restudy condition, subjects were only told to restudy
the words, whereas in a list discrimination condition, subjects
indicated whether the words occurred in list 1 or 2. The relative
effects of restudying or making list discrimination judgments were
assessed on a final free recall test. The general prediction was that
making list discrimination judgments would enhance final recall
relative to restudying, because the list discrimination task required
subjects to think back to the study episode and recollect informa-
tion about the temporal occurrence of items.
Experiments 2 and 3 examined the effects of initial list discrim-
ination on subsequent recall and also included semantic encoding
conditions in which subjects made pleasantness ratings or category
judgments, respectively, when they restudied the words. On the
basis of vast prior research, elaborative encoding was expected to
enhance recall relative to restudying. However, patterns of final
recall were expected to differ in the list discrimination and elab-
orative study conditions, reflecting differences in organizational
output strategies used during final recall.
The episodic context account predicts that retrieval practice
should produce patterns of recall output that differ from those in
restudy and elaborative encoding conditions. Specifically, if con-
text representations are updated during retrieval practice and sub-
jects use context to guide retrieval during subsequent recall, then
patterns of final recall output should show greater organization
around temporal dimensions after subjects have practiced retrieval
relative to when they restudied or made semantic judgments. The
present experiments explored several aspects of organization and
memory search dynamics during free recall. Measures of cluster-
ing were used to assess the extent to which recall was organized
around the original study order. Measures of temporal and seman-
tic factors, following Sederberg, Miller, Howard, and Kahana
(2010), examined the extent to which item-to-item transitions
during free recall followed the original temporal order of words or
the semantic relatedness of words, respectively. Finally, an addi-
tional analysis examined the dynamics of how people searched
memory during final recall, based on the idea that people forage
through memory representations in ways that are similar to how
animals forage in physical spaces (see Hills, Jones, & Todd, 2012;
Hills, Todd, & Jones, 2015).
Experiment 1
The purpose of Experiment 1 was to test two predictions based
on the episodic context account. First, making temporal judgments
about when words occurred in a study list should enhance retention
relative to restudying the words. In Experiment 1, subjects studied
a list of words, restudied or made list discrimination judgments
about the words, then took a final free recall test. The subjects in
both conditions reexperienced the words, but those in the list
discrimination condition were required to think back to the original
study episode and remember when the word had occurred. The
effects of restudying the words or making temporal judgments
were assessed on a final free recall test. The second prediction was
that final recall would exhibit greater organization around the
original temporal order of the items in the list discrimination
condition relative to the restudy condition, because retrieval prac-
tice in the list discrimination condition would result in the rein-
statement and subsequent updating of context. Analyses of tem-
poral clustering, temporal and semantic factors, and foraging
patterns during final recall were carried out to examine this pre-
diction.
Method
Subjects. Sixty Purdue University undergraduates partici-
pated in Experiment 1 in exchange for course credit.
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1037EPISODIC CONTEXT IN RETRIEVAL PRACTICE EFFECTS
Materials. Thirty-six medium frequency, medium concrete-
ness words were selected from the Clark and Paivio (2004) norms.
The words were divided into six lists of six words. The lists were
then paired to form three study blocks within the learning phase
(lists 1–2, lists 3– 4, and lists 5– 6 were study blocks 1, 2, and 3,
respectively). The words within each study block were equated for
concreteness, imagery, and frequency, and the order of the study
blocks was counterbalanced across subjects.
Design. Experiment 1 used a between-subjects design. There
were two conditions, list discrimination and restudy, and 30 sub-
jects were assigned to each condition.
Procedure. The subjects were tested in small groups of one to
four people. At the beginning of the experiment, subjects were told
that they would study several short lists of words and that their
memory for the words would be tested at the end of the experi-
ment. The study phase consisted of three study blocks. Within each
study block, subjects studied a list of six words, performed a brief
distracter task, studied a second list of six words, performed the
distracter task again, and then reexperienced the 12 words in either
a restudy or list discrimination task. In study periods, words were
presented on a computer screen one at a time at a 3-s rate with a
500-ms interstimulus interval. In the distracter task, subjects spent
30 s solving one- or two-digit addition problems. The problems
were shown one at a time on the computer, and subjects typed their
answers and pressed “Enter” to advance to the next problem. After
studying two lists, subjects were shown the 12 words from both
lists mixed together, one at a time at a 3-s rate with a 500-ms
interstimulus interval. At this point the critical manipulation oc-
curred. In the restudy condition, subjects were instructed to restudy
the list of words. In the list discrimination condition, subjects were
told that they had 3 seconds to indicate whether each word was
from list 1 or list 2 by clicking one of two buttons (labeled “List
1” and “List 2”) shown on the computer screen. The words
remained on the screen for 3 s regardless of when subjects made
their responses, and the computer program automatically advanced
to the next word after 3 s even if a response had not been made.
Thus, in both conditions, subjects reexperienced all 12 words for
the same amount of time; the difference was that one group
restudied the words, whereas the other group was required to think
back to the earlier part of the experiment and decide whether each
word occurred in the first or second list. After completing the
restudy or list discrimination task, subjects completed another 30
s of the distracter task and then advanced to the next part of the
experiment. This procedure, wherein subjects studied two lists and
then either restudied or made list discrimination judgments, was
repeated for the other two study blocks (lists 3– 4 and lists 5– 6),
for a total of three study blocks in the learning phase.
At the end of the learning phase, subjects completed an
additional 1 min of the distracter task and then took a final free
recall test. On the final test, subjects were given 5 min to recall
as many words as possible from the learning phase, in any
order. Subjects typed their responses into a response box on the
computer. They were instructed to press the “Enter” key after
they had typed each response, which added that response to a
list of their responses displayed on the computer screen. At the
end of the experiment the subjects were debriefed and thanked
for their participation.
Results
List discrimination performance. Overall, subjects entered
responses on 99% of trials (in total, there were 1080 trials (30
subjects � 36 trials per subject), and 1065 responses were re-
corded). The mean proportion correct on the list discrimination
task was .86. Response times were measured as the time between
the onset of the word and the subject’s mouse click. The average
response time for correct responses was 1.6 s. Table 1 shows the
mean proportion correct and mean response times across study
blocks in all three experiments. In Experiment 1, list discrimina-
tion performance did not change much across study blocks, F(2,
58) � 2.45, p � .10, �2 � 0.08, and response times tended to
become slightly faster across study blocks, F(2, 58) � 3.11, p �
.06, �2 � 0.10.
Final free recall. The key results of Experiment 1 are the
proportions of words recalled on the final free recall test, shown in
the left panel of Figure 1. Subjects in the list discrimination
condition recalled more items on the final test than did subjects in
the restudy group (.48 vs. .38), t(58) � 2.41, d � 0.62, 95% CI
[0.10, 1.14]. Thus, making a list discrimination judgment, which
required people to think back to and retrieve the original temporal
context in which a word occurred, produced a 10% final recall
advantage relative to restudying.
Table 2 shows an analysis of the relationship between initial list
discrimination performance and final free recall. Following Tulv-
ing’s (1964) convention for examining the fate of individual items
across two tests, C1 refers to items correctly identified on the initial
list discrimination test and N1 refers to items that were not correct
on the initial list discrimination test. C2 refers to items recalled on
the final free recall test and N2 refers to items not recalled on the
final test (see also Karpicke & Zaromb, 2010). This analysis is
correlational and subject to item-selection effects. Nevertheless,
the results indicate that when items were not correctly identified on
the list discrimination test (N1), it was unlikely that those items
would then be recalled on the final recall test (the joint probability
was .05 in Experiment 1). When items were correctly identified on
the list discrimination test (C1), they were much more likely to be
recalled on the final recall test (.41 in Experiment 1).
Temporal clustering during final recall. Clustering was
measured with adjusted ratio of clustering (ARC) scores (Roenker,
Table 1
Mean Proportion Correct and Response Time (in Milliseconds)
on the List Discrimination Tasks in All Experiments
Experiment Proportion correct Response time
Experiment 1
Block 1 .87 (.03) 1722 (60)
Block 2 .89 (.03) 1501 (63)
Block 3 .82 (.03) 1581 (78)
Experiment 2
Block 1 .88 (.02) 1686 (57)
Block 2 .84 (.02) 1708 (71)
Block 3 .85 (.02) 1676 (64)
Experiment 3
Block 1 .86 (.02) 1810 (70)
Block 2 .76 (.03) 1740 (66)
Block 3 .82 (.03) 1620 (58)
Note. Standard errors are in parentheses.
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1038 WHIFFEN AND KARPICKE
Thompson, & Brown, 1971). ARC scores range from �1 to 1,
where 0 represents chance clustering and 1 represents perfect
clustering around a dimension (negative scores are considered
uninterpretable; Murphy & Puff, 1982). ARC scores are typically
calculated to measure the extent to which a person’s recall output
is organized around semantic (e.g., taxonomic) categories. Here,
ARC scores were used to assess how well free recall was orga-
nized around study block (1, 2, or 3). The right panel in Figure 1
shows the mean temporal clustering scores. Subjects in the list
discrimination condition had higher temporal clustering scores
than did subjects in the restudy condition (.38 vs. .25), t(58) �
1.77, d � 0.46 [�0.06, 0.97]. The temporal clustering scores
indicate that subjects in the list discrimination condition organized
their recall around the original study order more than subjects in
the restudy condition did, which supports the idea that these
subjects used episodic context information to guide their output
during recall.
Temporal and semantic factors during final recall. To fur-
ther substantiate this interpretation, measures of temporal and
semantic factors were calculated following the methods proposed
by Sederberg et al. (2010). Temporal factors reflect the degree to
which transitions during free recall output followed the original
temporal order in which words were studied. Semantic factors
reflect the degree to which transitions during recall followed the
semantic relatedness of the words, which was defined as the
similarity scores for each pair of words in the study list based on
Latent Semantic Analysis (Landauer & Dumais, 1997). Briefly,
temporal and semantic factors for each recall protocol were cal-
culated in the following way. For each transition, all possible
transitions were ranked according to temporal proximity or seman-
tic relatedness for temporal or semantic factors, respectively. The
rank of the actual transition relative to all other possible transitions
was determined, and each transition received a score from 0 to 1,
with 1 representing the closest transition and 0 representing the
farthest. The average of the scores represented the temporal or
semantic factor for each protocol (see Sederberg et al., 2010, for
details). Therefore, temporal and semantic factors range from 0 to
1, where factors closer to 1 indicate that subjects transitioned to the
most temporally or semantically proximal words during recall, and
factors closer to 0 indicate that subjects transitioned to the least
temporally or semantically proximal words during recall.
Subjects in the list discrimination condition showed larger tem-
poral factors than did subjects in the restudy condition (.71 vs.
.61), t(58) � 3.64, d � 0.94 [0.40, 1.47], consistent with the
temporal clustering analysis carried out with ARC scores. In
contrast, there was essentially no difference in the semantic factors
in the list discrimination and restudy conditions (.54 vs. .55),
t(58) � 0.18, d � 0.05 [�0.46, 0.55].
Foraging patterns during final recall. The final analysis
examined the dynamics of how people searched memory during
final recall, based on the idea that people search memory in ways
that are similar to how animals forage in physical environments
(Hills et al., 2015). Specifically, people search memory by visiting
sets of items, referred to here as “patches,” and spend time recov-
ering items from one patch before switching and searching a
different patch. The analyses of temporal clustering and temporal
factor suggested that retrieval practice produced memory struc-
tures that were organized around temporally defined patches
(study blocks 1, 2, and 3). The foraging analysis explored this
further by examining transitions to and from each temporal patch
during free recall. The onset of a temporal patch visit occurred
when a subject recalled an item from a study block that differed
from the study block of the item recalled immediately before it,
and the end of a patch visit was defined as the onset of recall from
another patch. Subjects with well-defined structures, created by
practicing retrieval during learning, may engage in more efficient
searches than do subjects with memory structures that are not as
well defined. In particular, they may visit temporal patches fewer
times, recover more items per visit, and spend more time searching
per visit.
Overall, the mean number of patch visits did not differ between
the list discrimination and restudy conditions (7.30 vs. 7.37),
t(58) � 0.09, d � 0.02 [�0.48, 0.53]. However, subjects recovered
more items per visit in the list discrimination condition than they
did in the restudy condition (2.61 vs. 1.95), t(58) � 2.53, d � 0.65
[0.13, 1.17]. Subjects also spent more time searching during each
patch visit in the list discrimination condition than they did in the
restudy condition (26.5 s vs. 18.8 s), t(58) � 2.28, d � 0.59 [0.07,
1.10]. Table 3 shows the mean number of items recovered as a
function of visit number, and Table 4 shows the mean time spent
searching as a function of visit number. These data illustrate that
differences in number of items recovered and search times in the
list discrimination and restudy conditions were pronounced during
the first few visits, early in the recall period, and became less
pronounced during later recall.
Table 2
Fates of Individual Items in the List Discrimination Conditions:
Joint Probabilities Between Initial List Discrimination
Performance and Final Free Recall
Experiment C1C2 C1N2 N1C2 N1N2
Experiment 1 .41 (.03) .44 (.03) .05 (.03) .10 (.03)
Experiment 2 .38 (.03) .47 (.02) .04 (.01) .10 (.01)
Experiment 3 .45 (.03) .36 (.02) .08 (.01) .10 (.01)
Note. Standard errors are in parentheses. C1 � correct on the initial list
discrimination task; N1 � not correct on the initial list discrimination task;
C2 � items successfully recalled on the final free recall test; N2 � items
not recalled on the final free recall test.
Figure 1. Proportion correct on final free recall and temporal clustering
scores in Experiment 1. Error bars represent standard errors of the mean.
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1039EPISODIC CONTEXT IN RETRIEVAL PRACTICE EFFECTS
Discussion
Experiment 1 provided evidence consistent with the episodic
context account of retrieval practice. Subjects were required to
make a list discrimination judgment as a retrieval practice activity.
The task required subjects to think back to the study episode and
determine when each item had occurred in the study phase. All
items were represented to subjects in both conditions for the same
amount of time; the only difference between conditions was
whether subjects made a judgment about the previous occurrence
of the items. Experiment 1 showed that the act of making list
discrimination judgments produced a retrieval practice effect, en-
hancing subsequent recall relative to restudying. In addition, clus-
tering analyses indicated that subjects in the list discrimination
condition used the original study order as a strategy to guide recall
output, which further supports the context account of retrieval
practice. Experiment 2 was aimed at expanding upon these find-
ings.
Experiment 2
The goals of Experiment 2 were to replicate the main findings
from Experiment 1 and to compare the effects of making list
discrimination judgments to the effects of making semantic judg-
ments. The procedure was identical to the one used in Experiment
1 with the addition of a pleasantness rating condition. Rating the
pleasantness of words is a widely used semantic encoding task
that, unlike list discrimination, does not require subjects to engage
in episodic remembering. The effects of the three learning condi-
tions were assessed on a final free recall test and with analyses of
the organization of recall around episodic and semantic dimen-
sions.
Method
Subjects. One hundred twenty Purdue University undergrad-
uates participated in Experiment 2 in exchange for course credit.
None of the subjects had participated in Experiment 1. The number
of subjects in Experiment 2 was larger than the number in Exper-
iment 1 to improve power and the precision of effect size esti-
mates.
Materials. A new set of 36 medium frequency, medium con-
creteness words was selected from the Clark and Paivio (2004)
norms. As in Experiment 1, the words were divided into six lists of
six words and then paired to form three study blocks within the
learning phase. The words within each study block were equated
for concreteness, imagery, word frequency, and pleasantness as
determined by the ratings reported in Clark and Paivio (2004). The
Table 3
Mean Number of Items Recovered as a Function of Visit for All Conditions in All Experiments
Experiment Visit 1 Visit 2 Visit 3 Visit 4
Experiment 1
List discrimination 3.30 (.46) 2.97 (.37) 2.24 (.28) 2.25 (.29)
Restudy 2.50 (.40) 2.47 (.30) 2.03 (.25) 1.66 (.17)
Experiment 2
List discrimination 4.13 (.37) 2.28 (.25) 2.31 (.31) 1.94 (.24)
Restudy 2.33 (.28) 2.15 (.19) 2.54 (.28) 1.81 (.17)
Pleasantness 2.18 (.27) 1.93 (.21) 1.90 (.19) 1.77 (.15)
Experiment 3
List discrimination 3.48 (.49) 1.93 (.23) 2.13 (.25) 2.00 (.22)
Restudy 2.00 (.20) 1.80 (.26) 1.58 (.14) 1.56 (.15)
Category judgment 1.90 (.17) 1.53 (.13) 1.56 (.14) 1.46 (.12)
Note. The results are only reported up to the fourth visit because not all subjects had responses for five or more
visits. Standard errors are in parentheses.
Table 4
Mean Time (in Seconds) Spent Within Each Patch as a Function of Visit for All Conditions in
All Experiments
Experiment Visit 1 Visit 2 Visit 3 Visit 4
Experiment 1
List discrimination 10.7 (1.9) 12.6 (3.8) 14.2 (4.7) 24.2 (8.3)
Restudy 6.5 (.9) 6.1 (1.1) 7.1 (2.0) 5.7 (.9)
Experiment 2
List discrimination 12.4 (1.8) 10.6 (2.7) 19.9 (4.2) 27.9 (7.5)
Restudy 6.5 (.9) 7.7 (1.6) 29.5 (9.5) 13.96 (2.8)
Pleasantness 6.8 (.9) 5.6 (.9) 18.0 (6.9) 15.6 (3.8)
Experiment 3
List discrimination 10.7 (2.1) 4.6 (.9) 8.3 (2.4) 9.3 (2.6)
Restudy 5.6 (.8) 4.2 (.7) 5.1 (1.2) 5.6 (1.1)
Category judgment 5.2 (.6) 5.1 (1.3) 4.1 (.5) 4.3 (1.0)
Note. The results are only reported up to the fourth visit because not all subjects had responses for five or more
visits. Standard errors are in parentheses.
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1040 WHIFFEN AND KARPICKE
order of the study blocks was counterbalanced across subjects.
Pleasantness was equated such that each list pair had the same
number of words from each normed pleasantness rating (e.g., two
words with a normative pleasantness rating of 1, two with a rating
of 2, etc.).
Design. Experiment 2 used a between-subjects design. There
were three conditions: list discrimination, restudy, and pleasant-
ness. Forty subjects were assigned to each condition.
Procedure. The procedure was identical to the one used in
Experiment 1, with the addition of the pleasantness condition. The
procedure involved three phases: Subjects studied a list of words,
then restudied or made judgments about the words, and then took
a final free recall test. The procedures used in the restudy and list
discrimination conditions were identical to those used in Experi-
ment 1. In the pleasantness condition, when subjects were reex-
posed to the list of words, they rated the pleasantness of each word
on a scale from 1 (very pleasant) to 7 (very unpleasant) by clicking
one of the seven corresponding radio buttons displayed below the
word. The words remained on the screen for 3 s regardless of when
subjects made their responses, and the computer program auto-
matically advanced to the next word after 3 s even if a response
had not been made. In all conditions, subjects reexperienced the
words for the same amount of time; the difference was whether
subjects restudied the words, rated the pleasantness of the words,
or made a list discrimination decision about the words by thinking
back to the prior study episode.
Results
List discrimination performance. Subjects entered responses
on 96% of trials (in total, there were 1440 trials (40 subjects � 36
trials per subject), and 1389 responses were recorded). The mean
proportion correct on the list discrimination task was .85, and the
mean response time for correct responses was 1.7 s. As shown in
Table 1, there was little change in list discrimination performance
across study blocks, F(2, 78) � 1.21, �2 � 0.08, and, contrary to
the results of Experiment 1, response times did not differ much
across study blocks, F(2, 78) � 0.09, �2 � 0.00.
Pleasantness rating performance. In the pleasantness condi-
tion, subjects entered responses on 95% of trials (1370 responses
out of a total of 1440 trials). The mean response time was 2.0 s.
Final free recall. The left panel of Figure 2 shows the pro-
portion of words recalled on the final free recall test. As in
Experiment 1, subjects in the list discrimination condition recalled
more words than did subjects in the restudy condition (.43 vs. .31),
t(78) � 3.27, d � 0.73 [0.28, 1.18]. Subjects in the pleasantness
condition also outperformed subjects in the restudy condition (.41
vs. .31), t(78) � 3.51, d � 0.78 [0.33, 1.23]. There was little
difference in recall between the list discrimination and pleasant-
ness conditions, t(78) � 0.32, d � 0.07 [�0.37, 0.51]. Making list
discrimination and pleasantness judgments enhanced final recall
relative to restudying the words.
The middle row in Table 2 shows the relationship between
initial list discrimination performance and final free recall in
Experiment 2. When items were not correctly identified on the list
discrimination test, it was unlikely that those items were recalled
on the final recall test (the joint probability was .04). When items
were correctly identified on the list discrimination test, they were
much more likely to be recalled on the final recall test (.38).
Temporal clustering during final recall. The right panel of
Figure 2 shows temporal clustering scores, which were ARC
scores that assessed the extent to which recall was organized
around study block. Temporal clustering scores were higher in the
list discrimination condition than they were in the restudy condi-
tion (.38 vs. .24), t(78) � 2.16, d � 0.48 [0.04, 0.93] and in the
pleasantness condition (.38 vs. .16), t(78) � 3.89, d � 0.87 [0.41,
1.33]; for pleasantness versus restudy, t(78) � 1.45, d � 0.32
[�0.12, 0.76]. When subjects made list discrimination judgments
during the learning phase, they subsequently used temporal context
information to guide free recall, consistent with the episodic con-
text account.
An additional analysis examined the extent to which recall was
organized around pleasantness ratings. Pleasantness clustering
scores were calculated as ARC scores with normative pleasantness
ratings (from Clark & Paivio, 2004) as the organizing dimension.
Figure 2. Proportion correct on final free recall and temporal clustering scores in Experiment 2. Error bars
represent standard errors of the mean.
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1041EPISODIC CONTEXT IN RETRIEVAL PRACTICE EFFECTS
The highest pleasantness clustering scores were observed in the
restudy condition and were similar to those in the pleasantness
rating condition (.17 vs. .14), t(78) � 0.49, d � 0.11 [�0.33, 0.55].
Pleasantness clustering scores were slightly lower in the list dis-
crimination condition than they were in the restudy condition (.07
vs. .17), t(78) � 1.67, d � 0.37 [�0.07, 0.81], and in the pleas-
antness rating condition (.07 vs. .14), t(78) � 1.33, d � 0.30
[�0.14, 0.74]. In general, however, pleasantness clustering scores
were similar across all conditions. Thus, normative pleasantness
did not produce large influences on the organization of final recall.
Temporal and semantic factors during final recall. The
analyses of temporal and sematic factors during final recall pro-
vided further evidence that subjects in the list discrimination and
pleasantness rating conditions used different strategies during the
final recall task. Subjects in the list discrimination condition had
higher temporal factors relative to subjects in the restudy condition
(.67 vs. .59), t(78) � 3.16, d � 0.71 [0.25, 1.16] and subjects in the
pleasantness condition (.67 vs. .55), t(78) � 6.03, d � 1.35 [0.86,
1.83]. Temporal factors were slightly higher in the restudy condi-
tion than they were in the pleasantness condition, t(78) � 1.69,
d � 0.38 [�0.07, 0.82]. In contrast, semantic factors were similar
across conditions. As in Experiment 1, semantic factors were
similar in the list discrimination and restudy conditions (.51 vs.
.53), t(78) � 0.51, d � 0.11 [�0.32, 0.55]. Likewise, the factors
were similar in the pleasantness and restudy conditions (.50 vs.
.53), t(78) � 1.24, d � 0.28 [�0.16, 0.72]; for list discrimination
versus pleasantness, t(78) � 0.90, d � 0.20 [�0.24, 0.64].
Foraging patterns during final recall. The mean number of
temporal patch visits was greater in the list discrimination condi-
tion than in the restudy condition (6.70 vs. 5.78), t(78) � 1.62, d �
0.36 [�0.08, 0.80], and the number of items recovered per visit
was greater in the list discrimination condition than in the restudy
condition (2.57 vs. 2.15), t(78) � 1.74, d � 0.39 [�0.05, 0.83].
Search times during each patch visit were slightly longer in the list
discrimination condition than in the restudy condition (24.6 s vs.
22.7 s), t(78) � 0.58, d � 0.13 [�0.31, 0.57]. In the pleasantness
condition, the number of patch visits was greater than it was in the
list discrimination condition (8.90 vs. 6.70), t(78) � 3.61, d � 0.81
[0.50, 1.26], and in the restudy condition (8.90 vs. 5.78), t(78) �
5.12, d � 1.14 [0.67, 1.62]. Subjects in the pleasantness condition
also recovered fewer items per visit relative to subjects in the list
discrimination condition (1.79 vs. 2.57), t(78) � 3.58, d � 0.80
[0.34, 1.25], and subjects in the restudy condition (1.79 vs. 2.15),
t(78) � 2.17, d � 0.62 [0.17, 1.07]. Finally, search times during
each patch visit were slightly shorter in the pleasantness condition
than they were in the list discrimination condition (21.4 s vs. 24.6
s), t(78) � 1.10, d � 0.25 [�0.19, 0.69]. Search times were similar
in the pleasantness and restudy conditions (21.4 s vs. 22.7 s),
t(78) � 0.39, d � 0.09 [�0.35, 0.53]. Tables 3 and 4 shows that
the largest differences in number of items recovered and search
times, respectively, occurred during the first few visits, early in the
recall period.
Discussion
Experiment 2 replicated the key findings from Experiment 1.
Making list discrimination judgments enhanced final recall and
increased the temporal organization of recall relative to restudying.
Making semantic judgments (pleasantness ratings) also enhanced
subsequent recall but did not increase the degree of temporal
organization in final recall, as evidenced by the analyses of tem-
poral clustering, temporal factors, and foraging patterns during
final recall. The results provide further support for the idea that
reinstating the episodic context during initial learning improved
subsequent recall and promoted greater temporal organization on
the final test.
Experiment 3
Experiment 3 provided an additional examination of the effects
of retrieval practice on temporal and semantic organizational fac-
tors in recall. The procedure in Experiment 3 followed the proce-
dure used in the previous experiments, except that subjects studied
categorized lists of words rather than unrelated lists. The experi-
ment involved three conditions. In addition to restudy and list
discrimination conditions, which were identical to those used in
previous experiments, Experiment 3 included a category judgment
condition in which subjects identified the taxonomic categories of
the words. The category judgment task oriented subjects to seman-
tic attributes of the words and, like the pleasantness rating task in
Experiment 2, did not require subjects to think back to the study
episode. Whereas pleasantness ratings are thought to promote
retention by emphasizing the distinctiveness of items, category
judgments require subjects to process how items are related to an
organizational scheme. The analyses conducted in the previous
two experiments were also conducted in Experiment 3 with the
addition of analyses of clustering around semantic categories dur-
ing free recall (traditional ARC scores) and memory foraging
patterns based on semantic categories.
Method
Subjects. One hundred twenty Purdue University undergrad-
uates participated in this experiment in exchange for course credit.
None of the subjects had participated in Experiments 1 or 2.
Materials. Thirty-six words were selected from the Van Over-
schelde, Rawson, and Dunlosky (2004) norms. The most frequent
six exemplars were selected from six taxonomic categories (ani-
mals, fruits, body parts, clothing, instruments, and insects). As in
the previous experiments, the words were assigned to six lists of
six words. One word from each category was assigned to each list.
Design. Experiment 3 used a between-subjects design. There
were three conditions: list discrimination, restudy, and category
judgment. Forty subjects were assigned to each condition.
Procedure. The procedure was identical to the one used in
Experiment 1, with the addition of the category judgment condi-
tion. Subjects studied a list of words, then restudied or made
judgments about the words, and then took a final free recall test.
The restudy and list discrimination conditions were identical to
those in the previous experiments. In the category judgment con-
dition, subjects saw each word and two category alternatives (e.g.,
for the word banana, subjects might see fruits and animals as
alternatives). Subjects indicated which category the word belonged
to by clicking a button associated with the alternative. The words
remained on the screen for 3 s regardless of when subjects made
their responses, so that subjects in all conditions reexperienced the
words for the same duration.
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1042 WHIFFEN AND KARPICKE
Results
List discrimination performance. Subjects entered re-
sponses on 97% of trials (1396 responses on 1440 trials). The
mean proportion correct on the list discrimination task was .81,
and the average response time for correct responses was 1.7 s. As
shown in Table 1, there were differences in list discrimination
performance across blocks, F(2, 78) � 5.00, �2 � 0.11, and
response times tended to become faster across blocks, F(2, 78) �
3.31, �2 � 0.08.
Category judgment performance. In the category judgment
condition, subjects entered responses on 98% of trials (1416 re-
sponses on 1440 trials). The mean proportion correct was .99, and
the mean response time for correct responses was 1.7 s.
Final free recall. Figure 3 shows the proportion of words
recalled on the final free recall test. As in Experiments 1 and 2,
subjects in the list discrimination condition group recalled more
words than subjects in the restudy condition (.55 vs. .49), t(78) �
2.12, d � 0.47 [0.02, 0.92]. Subjects in the category judgment
condition slightly outperformed subjects in the restudy condition
by a small amount (.53 vs. .49), t(78) � 1.41, d � 0.25 [�0.19,
0.69]. There was little difference between the list discrimination
and category sorting conditions, t(78) � 0.54, d � 0.12 [�0.32,
0.56].
The bottom row in Table 2 shows the relationship between
initial list discrimination performance and final free recall in
Experiment 3. When items were not correctly identified on the list
discrimination test, those items were not likely to be recalled on
the final recall test (the joint probability was .08). When items
were correctly identified on the list discrimination test, they were
much more likely to be recalled on the final recall test (.45).
Temporal and semantic clustering during final recall. The
middle panel of Figure 3 shows temporal clustering scores, calcu-
lated as they were in previous experiments. Temporal clustering
scores were higher in the list discrimination condition than they
were in the restudy condition (.25 vs. .15), t(78) � 1.84, d � 0.41
[�0.03, 0.85], and in the category judgment condition (.25 vs.
.08), t(78) � 3.71, d � 0.83 [0.37, 1.28]; for category judgment
versus restudy, t(78) � 1.56, d � 0.35 [�0.09, 0.79]. The right
panel of Figure 3 shows semantic clustering scores, which were
ARC scores with taxonomic category as the organizing dimension.
The category judgment task produced the highest semantic clus-
tering scores, higher than scores in the list discrimination condition
(.41 vs. .21), t(78) � 3.67, d � 0.82 [0.36, 1.27], and slightly
higher than those in the restudy condition (.41 vs. .34), t(78) �
1.09, d � 0.24 [�0.20, 0.68]; for list discrimination versus re-
study, t(78) � 2.33, d � 0.52 [0.07, 0.97]. Thus, the pattern of
semantic clustering scores was the opposite of the pattern of
temporal clustering scores.
Temporal and semantic factors during final recall. The list
discrimination condition produced higher temporal factors relative
to the restudy condition (.60 vs. .55), t(78) � 2.09, d � 0.47 [0.02,
0.91], and the category judgment condition (.60 vs. .53), t(78) �
2.63, d � 0.59 [0.14, 1.03]. There was little difference between the
temporal factors in the restudy and category judgment conditions,
t(78) � 0.07, d � 0.02 [�0.42, 0.45]. However, the semantic
factors showed a different pattern of results. Semantic factors were
slightly higher in the category judgment condition relative to the
restudy condition (.64 vs. .61), t(78) � 1.34, d � 0.30 [�0.14,
0.74] and the list discrimination condition (.64 vs. .58), t(78) �
2.12, d � 0.47 [0.03, 0.92]; for list discrimination versus restudy,
t(78) � 1.52, d � 0.34 [�0.10, 0.78]. Overall, the patterns of
temporal and semantic factors across conditions matched the pat-
terns of temporal and semantic clustering in final recall.
Foraging patterns during final recall. In Experiment 3,
search patterns during final recall could have relied on temporal
patches (study blocks 1, 2, and 3) or semantic patches (taxonomic
categories). Thus, both possible ways of searching memory were
analyzed.
When examining foraging based on a temporal search strategy,
the mean number of temporal patch visits was numerically smaller
for list discrimination relative to restudy (10.10 vs. 11.20), t(78) �
1.14, d � 0.25 [�0.18, 0.69], and relative to category judgment
(10.10 vs. 12.30), t(78) � 2.56, d � 0.57 [0.12, 1.02]; for restudy
versus category, t(78) � 1.07, d � 0.24 [�0.20, 0.68]. Also, the
mean number of items recovered per visit was greater for list
discrimination compared to restudy (2.07 vs. 1.66), t(78) � 2.78,
Figure 3. Proportion correct on final free recall, temporal clustering scores, and semantic (category) clustering
scores in Experiment 3. Error bars represent standard errors of the mean.
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1043EPISODIC CONTEXT IN RETRIEVAL PRACTICE EFFECTS
d � 0.62 [0.17, 1.07], and category judgment (2.07 vs. 1.59),
t(78) � 3.45, d � 0.77 [0.31, 1.22]; for restudy versus category,
t(78) � 0.99, d � 0.22 [�0.21, 0.66]. Subjects spent more time per
visit in list discrimination compared to restudy (18.3 vs. 14.3 s),
t(78) � 2.20, d � 0.49 [0.05, 0.94], and category judgment (18.3
vs. 14.9), t(78) � 1.95, d � 0.44 [0.00, 0.88]; for restudy versus
category, t(78) � 0.39, d � 0.09 [�0.35, 0.68]. As in the previous
experiments, Tables 3 and 4 show that in Experiment 3, the largest
differences in number of items recovered and search times, respec-
tively, occurred during the first few visits, early in the recall
period.
The foraging analysis with semantic category as “patch” was
conducted the same way as the analysis of temporal patches except
the patches were the taxonomic categories used in the experiment
(i.e., fruit, clothing, animals, instruments, body parts, and insects).
The mean number of semantic patch visits was greater in the list
discrimination condition compared to restudy (14.63 vs. 12.08),
t(78) � 2.30, d � 0.51 [0.07, 0.96], and category judgment (14.63
vs. 11.9), t(78) � 2.69, d � 0.60 [0.15, 1.05]; for restudy versus
category, t(78) � 0.17, d � 0.04 [�0.40, 0.48]. However, the
mean number of items recovered per visit was smaller for list
discrimination compared to restudy (1.38 vs. 1.53), t(78) � 1.74,
d � 0.39 [�0.05, 0.83], and category judgment (1.38 vs. 1.67),
t(78) � 3.07, d � 0.69 [0.23, 1.14]; for restudy versus category,
t(78) � 1.32, d � 0.30 [�0.14, 0.74]. Subjects spent slightly more
time per visit in the list discrimination condition relative to restudy
(15.5 vs. 13.7), t(78) � 1.23, d � 0.28 [�0.17, 0.71], and relative
to category judgment (15.5 vs. 15.14), t(78) � 0.20, d � 0.04
[�0.39, 0.48]; for restudy versus category, t(78) � 0.89, d � 0.19
[�0.24, 0.64].
Discussion
Experiment 3 replicated the key findings from the previous
experiments. Making list discrimination judgments led to en-
hanced recall and temporally organized output relative to restudy.
However, this experiment was also able to examine semantic
organization in recall and found that making list discrimination
judgments led to a temporal output strategy, but making sorting
words into categories (semantic judgments) led to a semantically
based output strategy. Further, analyses of search patterns repli-
cated the previous experiments, but also showed that the category
judgment and restudy conditions were searched memory based on
semantically defined patches of information while the list discrim-
ination condition searched based on temporally defined patches.
This dissociation in how recall was organized further supports the
episodic context account that reinstatement of temporal context
allows subjects to use temporal information to guide output on the
criterial test.
General Discussion
The purpose of this project was to evaluate the core assumptions
of the episodic context account of retrieval-based learning. The
account proposes that when people engage in retrieval, they at-
tempt to reinstate the context of a prior learning episode. When
retrieval is successful, the context representation associated with
retrieved items is updated to include features of the retrieved
context and features of the present context. Consequently, when
people attempt to retrieve items again in the future, the updated
context representations facilitate retrieval of those items, and
memory performance is improved relative to situations in which
people had not practiced retrieval.
The present experiments examined predictions that follow from the
episodic context account. One prediction was that when subjects
experience an item, thinking back to a prior occurrence of that item
should enhance subsequent retention relative to conditions in which,
with all else held constant, people do not think back to a prior
occurrence. The present experiments manipulated the retrieval of
occurrence information with a list discrimination task, which required
subjects to make explicit judgments about when items had occurred in
a previous study episode. In all three experiments, initial list discrim-
ination enhanced final recall relative to restudying items under inten-
tional learning instructions. It is important to emphasize that subjects
reexperienced the entire list in both conditions. The only difference
between conditions was that subjects in the list discrimination condi-
tion were asked to think back to the prior occurrence of the words
while subjects in the restudy condition were not. To assess the overall
results across the three experiments in this report, overall effect sizes
comparing the list discrimination condition to the restudy condition
were calculated using weighted effect sizes and a fixed effect meta-
analysis model. The overall effect of retrieval practice in the list
discrimination condition relative to restudying was d � 0.61 [0.34,
0.88].
A second prediction derived from the episodic context account
was that initial retrieval practice would enhance the degree to
which final recall was organized around the original temporal
order of events. Patterns of organization during final recall were
assessed in several converging ways. Relative to the restudy con-
trol condition, retrieval practice in the list discrimination condition
enhanced the degree to which items were clustered around the
original study order (the overall effect was d � 0.45 [0.18, 0.72]).
Measures of temporal and semantic factors (Sederberg et al., 2010)
assessed the extent to which item-to-item transitions during free
recall followed the original temporal order of words or the seman-
tic relatedness of words. Retrieval practice enhanced temporal
factors during final recall, d � 0.68 [0.41, 0.95], but there was
little effect on semantic factors, d � 0.18 [�0.09, 0.44]. Finally, a
foraging analysis (Hills et al., 2012, 2015) examined the dynamics
of how people searched memory during final recall. Comparing the
list discrimination and restudy conditions, there was no difference
in the number of times subjects visited temporally defined patches
during memory search, d � 0.04 [�0.22, 0.31], but subjects in the
list discrimination condition recovered more items per visit, d �
0.54 [0.27, 0.81], and spent more time searching per visit, d � 0.38
[0.12, 0.65], than did subjects in the restudy condition. Practicing
retrieval had clear and consistent effects on search strategies
during the final recall test.
Practicing retrieval in the list discrimination condition produced
patterns of final recall that differed from those produced by elab-
orative study tasks, including rating the pleasantness of words
(Experiment 2) and judging category membership (Experiment 3).
Both elaborative study tasks enhanced retention relative to re-
studying the words, which was no surprise, because elaborative
encoding has been shown to enhance retention in decades of
research. However, whereas retrieval practice enhanced temporal
organization during final recall—as assessed with temporal ARC
scores, temporal factors, and foraging analyses— elaborative en-
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1044 WHIFFEN AND KARPICKE
coding tasks did not. For instance, the pleasantness and category
judgment tasks resulted in the least amount of temporal clustering
in Experiments 2 and 3 (see Figures 2 and 3), even less temporal
clustering than that in the restudy control condition. Whereas final
recall in the retrieval practice condition was clearly organized
around temporal dimensions, recall in the elaborative encoding
conditions tended to be more closely based on semantic factors.
Previous studies have compared retrieval practice to elaborative
study conditions and reasoned that if retrieval-based learning is
due to elaboration, then elaborative study and retrieval practice
tasks should produce the same final performance (see Karpicke &
Blunt, 2011; Karpicke & Smith, 2012; Lehman et al., 2014). Those
studies showed that retrieval practice and elaboration produce
different final test performance, which casts doubt on the idea that
the same mechanism or strategy was responsible for both effects.
In Experiments 2 and 3 in the present report, there was little
difference between retrieval practice and elaboration conditions
(pleasantness and category sorting) on final free recall. One might
be tempted to conclude that similar final test performance affirms
that retrieval practice effects are due to elaboration. However, this
reasoning would not be valid, because it relies on affirming the
consequent. Two different tasks can produce the same level of
performance via different mechanisms or strategies, and the pres-
ent experiments provide a prime example. The clustering mea-
sures, temporal and semantic factors, and foraging analyses
showed that retrieval practice and elaborative study tasks yielded
very different patterns of final recall organization, suggesting that
the effects were driven by different mechanisms in different con-
ditions.
It is worth considering the present findings in light of alternative
explanations of retrieval practice, such as the elaborative retrieval
account (Carpenter, 2009, 2011). This account proposes that as
people search for target items during the process of retrieval, other
items that are semantically related to the retrieval cue (related
words, or mediators) become activated. This semantic elaboration
assumed to occur during initial retrieval is also thought to be
responsible for enhancing retention on a subsequent test (see
Lehman & Karpicke, 2016). It is not readily apparent how the
elaborative retrieval account might explain the present results.
Making list discrimination judgments is, by definition, an episodic
task, and it is not clear why any activation of semantically related
words would occur when people attempt to judge the list mem-
bership of individual words. Even if list discrimination judgments
did induce such semantic elaboration, it would be hard to reconcile
the elaborative retrieval account with the present analyses of final
recall, which show that retrieval practice produced temporally
organized recall and, in some instances, reduced semantic organi-
zation (e.g., see Figure 3). Retrieval practice reliably enhanced
retention in the present experiments, but that enhancement was
driven by temporal factors, not semantic ones.
The episodic context account of retrieval-based learning shares
some similarities with ideas that have been proposed to explain
spaced repetition effects (see Karpicke et al., 2014). Specifically,
a spaced repetition may enhance retention because the repetition
reminds the learner of a previous occurrence (e.g., Wahlheim &
Jacoby, 2013) or, similarly, because the repetition affords retrieval
of a prior occurrence (an idea known as study-phase retrieval; e.g.,
Benjamin & Tullis, 2010). Wahlheim and Jacoby proposed that
when a person is reminded of a prior occurrence, the representation
of first presentation is “included” in the representation of the
second presentation. Raaijmakers (2003) implemented the idea of
study-phase retrieval in the SAM model (Raaijmakers & Shiffrin,
1981). In Raaijmakers’s account, when a studied item is repeated,
people may retrieve the trace of the prior presentation and, when
this happens, the context strength associated with that item is
incremented (Raaijmakers’s model incorporates additional as-
sumptions about contextual variability; see too Delaney, Verkoei-
jen, & Spirgel, 2010). As discussed by Karpicke et al. (2014), these
accounts of spacing effects share several features with episodic
context account of retrieval practice. One difference, however, is
that in studies of spaced repetition, the processes of reminding or
study-phase retrieval are incidental, assumed to occur spontane-
ously, whereas people are explicitly prompted to think back to a
prior occurrence when they practice retrieval. Most importantly,
the ideas of reminding and study-phase retrieval attribute the
benefits of spaced repetition to retrieval practice, which is itself a
phenomenon that needs to be explained. The ideas in the episodic
context account therefore add to reminding and study-phase re-
trieval theories by proposing mechanisms to explain how the
process of retrieval enhances subsequent retention.
The present project tested the core assumptions of the episodic
context account of retrieval-based learning and provided evidence
supporting the account. Thinking back to a prior learning epi-
sode—an essential ingredient of retrieval practice— enhances later
retention and produces fundamental changes in how learners or-
ganize subsequent recall.
References
Benjamin, A. S., & Tullis, J. (2010). What makes distributed practice
effective? Cognitive Psychology, 61, 228 –247. http://dx.doi.org/10
.1016/j.cogpsych.2010.05.004
Brewer, G. A., Marsh, R. L., Meeks, J. T., Clark-Foos, A., & Hicks, J. L.
(2010). The effects of free recall testing on subsequent source memory.
Memory, 18, 385–393. http://dx.doi.org/10.1080/09658211003702163
Carpenter, S. K. (2009). Cue strength as a moderator of the testing effect:
The benefits of elaborative retrieval. Journal of Experimental Psychol-
ogy: Learning, Memory, and Cognition, 35, 1563–1569. http://dx.doi
.org/10.1037/a0017021
Carpenter, S. K. (2011). Semantic information activated during retrieval
contributes to later retention: Support for the mediator effectiveness
hypothesis of the testing effect. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 37, 1547–1552. http://dx.doi.org/10
.1037/a0024140
Carpenter, S. K., & DeLosh, E. L. (2006). Impoverished cue support
enhances subsequent retention: Support for the elaborative retrieval
explanation of the testing effect. Memory & Cognition, 34, 268 –276.
http://dx.doi.org/10.3758/BF03193405
Chan, J. C. K., & McDermott, K. B. (2007). The testing effect in recog-
nition memory: A dual process account. Journal of Experimental Psy-
chology: Learning, Memory, and Cognition, 33, 431– 437. http://dx.doi
.org/10.1037/0278-7393.33.2.431
Clark, J. M., & Paivio, A. (2004). Extensions of the Paivio, Yuille, and
Madigan (1968) norms. Behavior Research Methods, Instruments, &
Computers, 36, 371–383. http://dx.doi.org/10.3758/BF03195584
Delaney, P. F., Verkoeijen, P. P. J. L., & Spirgel, A. (2010). Spacing and
testing effects: A deeply critical, lengthy, and at times discursive review
of the literature. In B. H. Ross (Ed.), Psychology of learning and
motivation (Vol. 53, pp. 63–147). San Diego, CA: Elsevier Academic
Press. http://dx.doi.org/10.1016/S0079-7421(10)53003-2
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1045EPISODIC CONTEXT IN RETRIEVAL PRACTICE EFFECTS
http://dx.doi.org/10.1016/j.cogpsych.2010.05.004
http://dx.doi.org/10.1016/j.cogpsych.2010.05.004
http://dx.doi.org/10.1080/09658211003702163
http://dx.doi.org/10.1037/a0017021
http://dx.doi.org/10.1037/a0017021
http://dx.doi.org/10.1037/a0024140
http://dx.doi.org/10.1037/a0024140
http://dx.doi.org/10.3758/BF03193405
http://dx.doi.org/10.1037/0278-7393.33.2.431
http://dx.doi.org/10.1037/0278-7393.33.2.431
http://dx.doi.org/10.3758/BF03195584
http://dx.doi.org/10.1016/S0079-7421%2810%2953003-2
Glover, J. A. (1989). The ‘testing’ phenomenon: Not gone but nearly
forgotten. Journal of Educational Psychology, 81, 392–399. http://dx
.doi.org/10.1037/0022-0663.81.3.392
Hills, T. T., Jones, M. N., & Todd, P. M. (2012). Optimal foraging in
semantic memory. Psychological Review, 119, 431– 440. http://dx.doi
.org/10.1037/a0027373
Hills, T. T., Todd, P. M., & Jones, M. N. (2015). Foraging in semantic
fields: How we search through memory. Topics in Cognitive Science, 7,
513–534. http://dx.doi.org/10.1111/tops.12151
Howard, M. W., & Kahana, M. J. (2002). A distributed representation of
temporal context. Journal of Mathematical Psychology, 46, 269 –299.
http://dx.doi.org/10.1006/jmps.2001.1388
Karpicke, J. D., & Blunt, J. R. (2011). Retrieval practice produces more
learning than elaborative studying with concept mapping. Science, 331,
772–775.
Karpicke, J. D., Lehman, M., & Aue, W. R. (2014). Retrieval-based
learning: An episodic context account. In B. H. Ross (Ed.), Psychology
of learning and motivation (Vol. 61, pp. 237–284). San Diego, CA:
Elsevier Academic Press.
Karpicke, J. D., & Smith, M. A. (2012). Separate mnemonic effects of
retrieval practice and elaborative encoding. Journal of Memory and
Language, 67, 17–29.
Karpicke, J. D., & Zaromb, F. M. (2010). Retrieval mode distinguishes the
testing effect from the generation effect. Journal of Memory and Lan-
guage, 62, 227–239. http://dx.doi.org/10.1016/j.jml.2009.11.010
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem:
The latent semantic analysis theory of acquisition, induction, and rep-
resentation of knowledge. Psychological Review, 104, 211–240. http://
dx.doi.org/10.1037/0033-295X.104.2.211
Lehman, M., & Karpicke, J. D. (2016). Elaborative retrieval: Do semantic
mediators improve memory? Journal of Experimental Psychology:
Learning, Memory, and Cognition, 42, 1573–1591. http://dx.doi.org/10
.1037/xlm0000267
Lehman, M., & Malmberg, K. J. (2013). A buffer model of memory
encoding and temporal correlations in retrieval. Psychological Review,
120, 155–189. http://dx.doi.org/10.1037/a0030851
Lehman, M., Smith, M. A., & Karpicke, J. D. (2014). Toward an episodic
context account of retrieval-based learning: Dissociating retrieval prac-
tice and elaboration. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 40, 1787–1794. http://dx.doi.org/10.1037/
xlm0000012
Murphy, M. D., & Puff, C. R. (1982). Free recall: Basic methodology and
analysis. In C. R. Puff (Ed.), Handbook of research methods in human
memory and cognition (pp. 99 –128). San Diego, CA: Academic Press.
http://dx.doi.org/10.1016/B978-0-12-566760-9.50009-9
Nunes, L. D., & Karpicke, J. D. (2015). Retrieval-based learning: Research
at the interface between cognitive science and education. In R. A. Scott
& S. M. Kosslyn (Eds.), Emerging Trends in the Social and Behavioral
Sciences (pp. 1–16). Hoboken, NJ: Wiley http://dx.doi.org/10.1002/
9781118900772.etrds0289
Pu, X., & Tse, C.-S. (2014). The influence of intentional versus incidental
retrieval practices on the role of recollection in test-enhanced learning.
Cognitive Processing, 15, 55– 64. http://dx.doi.org/10.1007/s10339-013-
0580-2
Raaijmakers, J. G. W. (2003). Spacing and repetition effects in human
memory: Application of the SAM model. Cognitive Science, 27, 431–
452. http://dx.doi.org/10.1207/s15516709cog2703_5
Raaijmakers, J. G. W., & Shiffrin, R. M. (1981). Search of associative
memory. Psychological Review, 88, 93–134. http://dx.doi.org/10.1037/
0033-295X.88.2.93
Roediger, H. L., & Karpicke, J. D. (2011). Intricacies of spaced retrieval:
A resolution. In A. S. Benjamin (Ed.), Successful remembering and
successful forgetting: Essays in honor of Robert A. Bjork (pp. 23– 48).
New York, NY: Psychology Press.
Roenker, D. L., Thompson, C. P., & Brown, S. C. (1971). Comparison of
measures for the estimation of clustering in free recall. Psychological
Bulletin, 76, 45– 48. http://dx.doi.org/10.1037/h0031355
Rowland, C. A. (2014). The effect of testing versus restudy on retention: A
meta-analytic review of the testing effect. Psychological Bulletin, 140,
1432–1463. http://dx.doi.org/10.1037/a0037559
Sederberg, P. B., Miller, J. F., Howard, M. W., & Kahana, M. J. (2010).
The temporal contiguity effect predicts episodic memory performance.
Memory & Cognition, 38, 689 – 699. http://dx.doi.org/10.3758/MC.38.6
.689
Tulving, E. (1964). Intratrial and intertrial retention: Notes towards a
theory of free recall verbal learning. Psychological Review, 71, 219 –
237. http://dx.doi.org/10.1037/h0043186
Tulving, E. (1983). Elements of episodic memory. New York, NY: Oxford
University Press.
Van Overschelde, J. P., Rawson, K. A., & Dunlosky, J. (2004). Category
norms: An updated and expanded version of the Battig and Montague
(1969) norms. Journal of Memory and Language, 50, 289 –335. http://
dx.doi.org/10.1016/j.jml.2003.10.003
Verkoeijen, P. P. J. L., Tabbers, H. K., & Verhage, M. L. (2011). Com-
paring the effects of testing and restudying on recollection in recognition
memory. Experimental Psychology, 58, 490 – 498. http://dx.doi.org/10
.1027/1618-3169/a000117
Wahlheim, C. N., & Jacoby, L. L. (2013). Remembering change: The
critical role of recursive remindings in proactive effects of memory.
Memory & Cognition, 41, 1–15. http://dx.doi.org/10.3758/s13421-012-
0246-9
Received June 6, 2016
Revision received November 14, 2016
Accepted November 16, 2016 �
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1046 WHIFFEN AND KARPICKE
http://dx.doi.org/10.1037/0022-0663.81.3.392
http://dx.doi.org/10.1037/0022-0663.81.3.392
http://dx.doi.org/10.1037/a0027373
http://dx.doi.org/10.1037/a0027373
http://dx.doi.org/10.1111/tops.12151
http://dx.doi.org/10.1006/jmps.2001.1388
http://dx.doi.org/10.1016/j.jml.2009.11.010
http://dx.doi.org/10.1037/0033-295X.104.2.211
http://dx.doi.org/10.1037/0033-295X.104.2.211
http://dx.doi.org/10.1037/xlm0000267
http://dx.doi.org/10.1037/xlm0000267
http://dx.doi.org/10.1037/a0030851
http://dx.doi.org/10.1037/xlm0000012
http://dx.doi.org/10.1037/xlm0000012
http://dx.doi.org/10.1016/B978-0-12-566760-9.50009-9
http://dx.doi.org/10.1002/9781118900772.etrds0289
http://dx.doi.org/10.1002/9781118900772.etrds0289
http://dx.doi.org/10.1007/s10339-013-0580-2
http://dx.doi.org/10.1007/s10339-013-0580-2
http://dx.doi.org/10.1207/s15516709cog2703_5
http://dx.doi.org/10.1037/0033-295X.88.2.93
http://dx.doi.org/10.1037/0033-295X.88.2.93
http://dx.doi.org/10.1037/h0031355
http://dx.doi.org/10.1037/a0037559
http://dx.doi.org/10.3758/MC.38.6.689
http://dx.doi.org/10.3758/MC.38.6.689
http://dx.doi.org/10.1037/h0043186
http://dx.doi.org/10.1016/j.jml.2003.10.003
http://dx.doi.org/10.1016/j.jml.2003.10.003
http://dx.doi.org/10.1027/1618-3169/a000117
http://dx.doi.org/10.1027/1618-3169/a000117
http://dx.doi.org/10.3758/s13421-012-0246-9
http://dx.doi.org/10.3758/s13421-012-0246-9
- The Role of Episodic Context in Retrieval Practice Effects
Experiment 1
Method
Subjects
Materials
Design
Procedure
Results
List discrimination performance
Final free recall
Temporal clustering during final recall
Temporal and semantic factors during final recall
Foraging patterns during final recall
Discussion
Experiment 2
Method
Subjects
Materials
Design
Procedure
Results
List discrimination performance
Pleasantness rating performance
Final free recall
Temporal clustering during final recall
Temporal and semantic factors during final recall
Foraging patterns during final recall
Discussion
Experiment 3
Method
Subjects
Materials
Design
Procedure
Results
List discrimination performance
Category judgment performance
Final free recall
Temporal and semantic clustering during final recall
Temporal and semantic factors during final recall
Foraging patterns during final recall
Discussion
General Discussion
References
Please do not include this Method section in your final research report. Write your report as
if it is present.
Method
Participants
565 introductory psychology students voluntarily took part in the current study as part of
requirements for their course (190 Male, 373 Female, 2 Non-binary), with a mean age of 19.87
years (Min. 17, Max. 54). Participants were tested in groups of no more than 24 and completed
the study as part of their allocated tutorial class.
Materials
Thirty-six words were selected from the Van Overschelde, Rawson and Dunlosky (2004)
norms. The most frequent six exemplars that were between three and eight letters in length were
selected from four taxonomic categories – ‘animals’, ‘fruits’, ‘musical instruments’, and ‘weather
event’. Three words from each list were taken to create three interrelated lists containing 12
words each. Presentation order was random in all conditions. Word and list presentation were
counterbalanced.
Design and procedure
In total 4 conditions were compared between-subjects: Restudy, Retrieval (Free Recall),
List Discrimination, and Category Judgement. Participants were randomly allocated to each
condition. Figure 1 illustrates the experimental design. In all learning blocks participants were
shown a list1 of words (Study lists: SL1, SL2, SL3), then required to engage in a 40s distractor
1 In the list discrimination condition the first half of each list was labelled “List 1.1” or “List 2.1” and the second half
of each list was labelled “List 1.2” or “List 2.2” but other than this minor break, these lists were presented as they
were in all other conditions.
task which required them to solve maths problems. In learning blocks 1 and 2, this was followed
by an interval task (refer to Figure 2) which differed across conditions, and consisted of either:
Retrieval: Participants were asked to try and retrieve as many words as they could from
the list (free recall) and were given 60s to do so.
Restudy: Participants were shown the list again in exactly the same manner as in the
study phase.
Category judgement: Participants saw each word and two category alternatives which
appeared underneath (e.g., for the word Horse, subjects were asked to decide if this was a fruit or
an animals). Participants indicated which category the word belonged to by clicking a labelled
button.
List discrimination: During the study phase the words had been split into two smaller lists
(lists 1.1 and 1.2, or 2.1 and 2.2), and in this condition the words were shown again and the task
required participants to indicate which list the word had appeared in.
Figure 1. The experiment consisted of three Learning Blocks. Blocks 1 and 2 contained the same interval tasks, which were
manipulated between‐subjects. Block 3 was the exact same for each condition and the interval task was always a free recall. List
3 free recall was compared.
In the final learning block (Learning Block 3), all participants in all conditions were
shown a single list of words (List 3). They were then required to engage in the same 40s
distractor task which required them to solve maths problems. However in Learning Block 3 there
was no interval task, and all participants completed a final task (identical across all conditions)
which consisted of a retrieval (free recall) task of the List 3 words which lasted 60s. Memory
performance on this task provided the key dependent measure for the study.
Figure 2. Examples of interval tasks performed in Learning Blocks 1 and 2. During Interval 3 all subjects
completed retrieval (free recall).
Contents lists available at ScienceDirect
Journal of Memory and Language
journal homepage: www.elsevier.com/locate/jml
T
esting potentiates new learning across a retention interval and a lag: A
strategy change perspective
Jason C.K. Chana,⁎, Krista D. Manleya, Sara D. Davisa, Karl K. Szpunarb
a Iowa State University, USA
b University of Illinois at Chicago, USA
A R T I C L E I N F O
Keywords:
Retrieval practice
New learning
Test-potentiated learning
Forward testing effect
Relational processing
Strategy change
A B S T R A C T
Practicing retrieval on previously studied materials can potentiate subsequent learning of new materials. In four
experiments, we investigated the influence of retention interval and lag on this test-potentiated new learning
(TPNL) effect. Participants studied four word lists and either practiced retrieval, restudied, or completed math
problems following Lists 1–3. Memory performance on List 4 provided an estimate of new learning. In
Experiments 1 and 2, participants were tested on List 4 after either a 1 min or 25 min retention interval. In
Experiments 3 and 4, participants took at 25 min break before studying List 4. A TPNL effect was observed in all
experiments. To gain insight into the mechanism that may underlie TPNL, we analyzed the extent to which
participants organized their recall from list to list. Relative to restudy and math, testing led to superior semantic
organization across lists. Our results support a strategy change account of TPNL.
Introduction
A growing body of research has shown that interspersing encoding
with test questions can strengthen student learning. For example, when
viewing a lecture, students who answer quiz questions throughout a
lecture often better remember the tested information than students who
are not quizzed (i.e., the testing effect, McDaniel, Roediger, &
McDermott, 2007). More important for present purposes, however,
students who answer interspersed quiz questions also better learn new
information presented after the quiz than students who are not quizzed
(e.g., Jing, Szpunar, & Schacter, 2016; Szpunar, Khan, & Schacter,
2013). That is, interspersed testing enhances new learning. In this paper,
we refer to this benefit of testing as test-potentiated new learning, or
TPNL.
The TPNL effect is typically investigated using a multi-list or multi-
section learning paradigm. For example, subjects may be asked to
memorize two lists of words. After studying List 1, subjects may take a
test for that list (i.e., the interspersed-testing condition) or not (i.e., the no-
testing condition) before they study List 2. In this example, List 1 re-
presents original learning and List 2 represents new learning, and differ-
ences in performance for List 2 between the interspersed-testing con-
dition and the no-testing condition demonstrate the influence of testing
on new learning. A wealth of research has shown that testing can fa-
cilitate learning of new information (Chan, Meissner, & Davis, in pre-
paration; Pastotter & Bauml, 2014; Yang, Potts, & Shanks, 2018). In
general, the TPNL effect is robust and applicable to a variety of learning
situations. For example, testing can promote new learning of lists of
single words (Szpunar, McDermott, & Roediger, 2008), word pairs
(Tulving & Watkins, 1974; Wahlheim, 2015), picture-word pairs (Davis
& Chan, 2015; Weinstein, McDermott, & Szpunar, 2011), text passages
(Wissman & Rawson, 2015; Wissman, Rawson, & Pyc, 2011), and video
lectures (Szpunar et al., 2013). However, several factors have also been
shown to moderate this effect, including the participants’ perceived
likelihood of being tested (Weinstein, Gilmore, Szpunar, & McDermott,
2014; but see also Wissman et al., 2011) and the frequency with which
participants have to switch between retrieval and new learning (Davis
& Chan, 2015; Davis, Chan, & Wilford, 2017).
Despite the increasingly sizable literature on the phenomenon of
TPNL, we currently know very little about the persistence of this effect
across a time delay. In the present study, we focus on two kinds of time
delay: (1) the retention interval between new learning and its assessment
and (2) the time between presentation of original and new learning,
which we refer to as “lag.” The dearth of research on delay is particu-
larly glaring given the copious amount of evidence for the longevity of
the testing effect (for reviews, see Adesope, Trevisan, & Sundararajan,
2017; Rowland, 2014). We now describe the importance of these two
types of delay for both the application of interspersed testing to edu-
cation and the theoretical understanding of TPNL.
https://doi.org/10.1016/j.jml.2018.05.007
Received 5 October 2017; Received in revised form 12 May 2018
⁎ Corresponding author at: Department of Psychology, Iowa State University, Ames, IA 50011, USA.
E-mail address: ckchan@iastate.edu (J.C.K. Chan).
Journal of Memory and Language 102 (2018) 83–9
6
Available online 30 May 2018
0749-5
96
X/ Published by Elsevier Inc.
T
http://www.sciencedirect.com/science/journal/074
95
96X
https://www.elsevier.com/locate/jml
https://doi.org/10.1016/j.jml.2018.05.007
https://doi.org/10.1016/j.jml.2018.05.007
mailto:ckchan@iastate.edu
https://doi.org/10.1016/j.jml.2018.05.007
http://crossmark.crossref.org/dialog/?doi=10.1016/j.jml.2018.05.007&domain=pdf
Retention interval
To date, most studies in this literature have assessed the influence of
interspersed testing on new learning at very short (1 min) retention
intervals (Aslan & Bauml, 2015; Chan, Thomas, & Bulevich, 2009; Davis
& Chan, 2015; Szpunar et al., 2008; Tulving & Watkins, 1974;
Wahlheim, 2015; Weinstein et al., 2011; Wissman & Rawson, 2015),
and the effects of retention interval on TPNL have rarely been the focus
of extant investigations. Moreover, studies that have included multiple
retention intervals have produced mixed results. For example, Szpunar
et al. (2008) had participants study five lists of words, and participants
either completed an immediate test or math problems following each of
the first four lists. After studying a fifth list, all participants received a
test for that list. The List 5 test allowed the researchers to examine the
immediate impact of interspersed testing (relative to math) on new
learning (of List 5). Furthermore, all participants took a final recall test
of all studied items (including items from List 5) 30 min later, and the
results of this delayed test showed that the TPNL effect persisted across
the 30 min retention interval (see also Jing et al., 2016; Nunes &
Weinstein, 2012; Pierce, Gallo, & McCain, 2017; Szpunar et al., 2013;
Weinstein et al., 2011 in which the TPNL effect persisted across a re-
tention interval of 5 min or less). In fact, across two experiments, the
TPNL effect was nearly identical regardless of whether one examines
performance in the 1-min (d = 1.52) or 30-min delay test (d = 1.60). In
contrast, in an experiment that employed a similar design, Wissman and
Rawson (2015 Experiment 4) found that the TPNL effect at immediate
testing (d = 1.66) was substantially diminished after just a 15-min
delay (d = 0.78), which suggests that the TPNL effect might be some-
what ephemeral. Indeed, a recent meta-analysis found that studies that
used longer retention intervals tended to produce a smaller TPNL effect
than studies that employed shorter retention intervals (Chan et al., in
preparation). However, as with all moderator analyses that include data
from different studies, the result of this meta-regression is correlational
in nature and must be interpreted with caution. Hence, more research is
needed to evaluate the influence of retention interval on TPNL, parti-
cularly if the benefits of interpolated testing are to be interpreted as
having relevance for educational practice.
Perhaps even more important than the mixed results on TPNL and
retention interval is that interpretation of final test performance in
existing studies is not straightforward. For instance, Szpunar et al.
(2008) and Wissman and Rawson (2015) required participants to take
both an immediate test and a final test for the new learning materials.
Therefore, recall performance on the delayed final test was con-
taminated by that of the immediate test, making it difficult to estimate
the true effects of retention interval on TPNL. In Experiments 1 and 2,
we sought to assess the influence of retention interval on TPNL without
this potential source of contamination. Specifically, we administered
only one test for the critical new learning material after a filled reten-
tion interval of either 1 min (to clear short-term memory) or 25 min.
Lag
To our knowledge, no studies to date have examined the influence
of lag (i.e., the delay between original learning and new learning) on
TPNL in a multi-list learning paradigm. The effects of lag on TPNL have
important implications both for the implementation of interspersed
testing in the classroom and the theoretical understanding of TPNL.
Specifically, pedagogical guides often stress the fact that sustaining
attention for long periods of time can be difficult, which can lead to
frequent mind wanderings by the learner (Bunce, Flens, & Neiles, 2010;
Risko, Anderson, Sarwal, Engelhardt, & Kingstone, 2012; Szpunar,
2017). Educators possess an intuitive understanding of the fact that
learners often struggle to sustain attention, and they suggest taking
breaks as a strategy for counteracting the negative impact of time-on-
task for learning information presented at the end of long study se-
quences. These breaks can take various forms, including asking learners
questions (i.e., testing), presenting a video, having group discussions, or
simply giving students a bathroom break (Centre for Teaching
Excellence – University of Waterloo., 2012; Olmsted, 1999). Such study
breaks are thought to be effective at helping students refocus and learn
new information, because they allow students to temporarily deactivate
the prolonged task-goal of learning and attend to activities with dif-
ferent task goals (Ariga & Lleras, 2011). Similarly, taking interspersed
tests can enhance new learning by providing a break from the encoding
activities required by a prolonged study sequence. Specifically, forcing
participants to switch the task from encoding to retrieval has been
hypothesized to initiate a context change (Jang & Huber, 2008; Jonker,
Seli, & MacLeod, 2013; Whiffen & Karpicke, 2017), which allows par-
ticipants to “reset” their encoding operations (Pastotter, Schicker,
Niedernhuber, & Bauml, 2011).
An important question is whether testing enhances new learning
because it essentially serves as a study break, or if retrieval is “special”
in its ability to enhance new learning beyond providing a break to
encoding activities. If the former possibility proves correct, then testing
should not facilitate new learning when compared to a condition in
which new learning occurs following a study break. In Experiments
3
and 4, we aimed to examine the influence of lag on TPNL. Specifically,
participants studied four lists of words. After presentation of each of the
first three lists, participants either recalled the list, performed mental
arithmetic, or restudied the list. To examine whether the benefits of
testing on new learning are distinct from those afforded by providing a
study break, we inserted a 25-min filled lag just before participants
studied List 4. During this lag, participants took a break from encoding
by completing a series of brain teasers and then playing the videogame
Tetris (more details about these tasks are described in the
Method
section of Experiment 3). These tasks were selected as the lag activities
because they differed substantially from the encoding task.
A strategy change perspective of test-potentiated new learning
In the preceding section, we described one potential mechanism by
which interpolating retrieval can facilitate subsequent learning –
namely, that changing from an encoding context to a retrieval context
may provide a break from the encoding activities (Jang & Huber, 2008).
Research in verbal learning (Gunter, 1980; Wickens, 1970) and inten-
tional forgetting (Sahakyan & Kelley, 2002) have repeatedly demon-
strated that changing context can release learners from the negative
impact of proactive interference, thereby facilitating new learning.
From this perspective, inserting memory tests and inserting study
breaks into an encoding session may serve similar functions. An alter-
native account, however, posits that interpolated testing enhances new
learning beyond context change — specifically, testing may enhance
new learning by offering an opportunity for participants to switch to
more effective encoding strategies during new learning.
According to this strategy change account, taking a memory test can
lead participants to use different, and perhaps superior, encoding
strategies for later learning (Cho, Neely, Crocco, & Vitrano, 2017;
Gordon & Thomas, 2017), because the test provides participants with
important, performance-relevant information such as test format, the
type of retrieval cues available, the amount of time available for re-
trieval, etc. The idea that performing retrieval can alter how partici-
pants approach subsequent learning has received some empirical sup-
port. For example, learners reported that they were more likely to use
deeper encoding strategies when relearning previously studied mate-
rials after a test trial than after a restudy trial (Soderstrom & Bjork,
2014). Further, taking a test can alter how participants distribute their
encoding or attentional resources during subsequent encoding oppor-
tunities (Chan, Manley, & Lang, 2017; Gordon & Thomas, 2014; Jing
et al., 2016; Szpunar et al., 2013). For example, in a recent study using
a triad learning paradigm (Davis & Chan, 2015; see also Finn &
Roediger, 2013), participants first studied a set of face-name pairs.
Next, participants either restudied or recalled the name associated with
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
84
each face (i.e., original learning) before they studied the profession for
that face (i.e., new learning). Importantly, during this new learning
trial, both the face-name (i.e., original learning) and face-profession
(i.e., new learning) associations were present for study. Surprisingly,
instead of demonstrating the usual TPNL effect, testing impaired
learning of the new, face-profession association. Through several ex-
periments, Davis and Chan (2015) attributed this result to the fact that
attempting to retrieve the face-name association altered how partici-
pants approached the encoding task when the face-profession associa-
tion (along with the face-name association) was presented for encoding.
Specifically, they argued that the face-name association test trial, but
not the restudy trial, revealed to participants the difficulty of learning
the face-name pair. When the new, face-profession association was
presented for study, participants “borrowed time” from the new-
learning trial to restudy the face-name association, thus impairing new
learning. Moreover, recent research has shown that testing can affect
both test expectancy and the amount of time participants spend on
future learning activities. For example, taking a test increases learners’
expectation that they will be tested again in the near future (Weinstein
et al., 2014). Perhaps partly because of this increased test expectancy,
when learners were allowed to self-regulate their study duration, those
who received interpolated tests spent longer to study new information
than those who did not (Gordon & Thomas, 2014; Gordon, Thomas, &
Bulevich, 2015; Yang, Potts, & Shanks, 2017).
Although the results of the above-cited studies are consistent with
the idea that retrieval can cause a strategy change for subsequent en-
coding, they do not provide direct evidence that strategy change un-
derlies TPNL. Specifically, Soderstrom and Bjork’s (2014) results were
based on a relearning, not new-learning, paradigm, and the data re-
garding strategy change were based on participants’ subjective report.
Moreover, Davis and Chan (2015) did not provide direct evidence of
strategy change, because they did not measure the amount of time that
participants devoted to relearning of the face-name association relative
to new learning of the face-profession association. Lastly, although the
findings that prior testing increases test expectancy (Weinstein et al.,
2014) and new-learning duration (Gordon et al., 2015; Yang et al.,
2017) signal a shift in strategy, these findings do not provide evidence
that the strategy change is qualitative in nature. In the present ex-
periments, we attempted to provide a more direct test of this strategy
change account by examining organization in recall on a list-by-list
basis.
Prior work on interpolated testing has focused primarily on the
quantity of the learning that takes place during new learning trials (e.g.,
how many words are correctly recalled), and little is known about the
quality of the learning. To the extent that the type of strategy that
participants use to encode items is reflected in the way they recall these
items, we can assess their strategy use by examining how they organize
their recall. Indeed, prior work has shown that interpolated testing can
serve to boost integration of information presented within and across
video lecture segments on a final cumulative test (Jing et al., 2016).
Nonetheless, no such analyses have been conducted in the context of
TPNL experiments using word list stimuli, and more importantly, in a
manner that assesses response organization during initial and new
learning. Critically, a list-by-list analysis of output order based on se-
mantic clustering is necessary to understand how interpolated testing
affects participants’ approach to retrieval, and perhaps encoding, of the
lists. To address this gap in the literature, we asked participants in all
four experiments to learn lists comprising words that belonged to sev-
eral categories, and we analyzed the extent to which free recall of each
list was characterized by category-based clustering. Recent work has
shown that testing can serve to boost category clustering of word sti-
muli (Zaromb & Roediger, 2010). Accordingly, we predicted that in-
terpolated testing should result in higher levels of category clustering
during new learning as compared to no-testing and restudying.
To summarize, in the present experiments, we sought to examine
whether time delay alters the beneficial effects of testing on new
learning. Specifically, we compared testing with no-testing in
Experiment 1, and we compared testing with restudying in Experiment
2. In each of these experiments, we examined the magnitude of the
TPNL effect following a 1-min or 25-min retention interval. In
Experiments 3 and 4, we examined the effects of lag on TPNL. Here,
new learning occurred following either a 1-min lag (in Experiment 4) or
a 25-min lag (in Experiments 3 and 4), and we compared testing to
restudying and no-testing.
Experiment
1
Method
Design and participants
Intervening task (testing vs. no-testing) and retention interval
(1 min vs. 25 min) were manipulated between-subjects. Participants
were 1
86
undergraduate students from Iowa State University, who
completed the experiment for course credit. English was not the pri-
mary language for 18 participants and their data were removed from
analysis. Moreover, data from an additional 22 participants were re-
moved because the experimenter ran the incorrect experiment program
for the study phase and the delayed test. Therefore, data from 146
participants were analyzed. There were 36 participants in the no-testing,
1-min retention interval condition, 39 participants in the testing, 1-min
retention interval condition, 37 participants in the no-testing, 25-min re-
tention interval condition, and 34 participants in the testing, 25-min re-
tention interval condition. We determined the desired sample size based
on a meta-analytic effect size of TPNL (g = 0.75, Chan et al., in pre-
paration). To achieve
85
% power, each between-subjects condition
required 34 participants.
Materials and procedure
Four interrelated lists with 15 words each were constructed.
Each list contained three exemplars from five categories (Van
Overschelde, Rawson, & Dunlosky, 2004). The five categories were
animals, weather, fruits, human body parts, and building parts. Al-
though the average taxonomic frequencies differed across the five ca-
tegories (Manimals = .17, Mweather = .15, Mfruits = .25, Mbodyparts = .33,
Mbuilding = .22), F(4, 55) = 4.63, p = .003, they did not differ across the
four lists (range = .21–.24), F(3, 56) = 0.26, p = .86, B01 = 8.58.
Fig. 1 illustrates the experimental design. Participants were in-
formed that they would see several word lists of 15 words, with each
word presented twice within a list.1 They were also told that they would
complete some math problems after studying each list, and then they
would either take a memory test for the list or not, with the occurrence
of the test being determined randomly by the computer. In actuality,
participants were either tested after every list (testing condition) or
only after List 4 (no-testing condition). In all memory tests, participants
were told to recall words from only the most recent list, but all parti-
cipants were told to expect a cumulative final test for all studied words.
For each list, a prompt (e.g., “This is Word List 1”) appeared for 2 s,
followed by a fixation cross that appeared in the middle of the screen
for 1 s. Next, the words were presented for 4 s each, with the pre-
sentation of each word separated by a 500 ms blank interval. Each list
was presented twice with no breaks in between, but a different random
order was used during each presentation. List order was counter-
balanced across participants. After studying each list, participants
completed 60 s of math problems. Next, participants either completed
an additional 60 s of math problems (no-testing condition) or they were
given a free recall test for 60 s (testing condition).
In the 1-min retention interval conditions, the List 4 test began after
1 We opted to present each word list twice because pilot testing (N = 14) revealed
near-floor recall performance following a 25-min retention interval when the words were
presented only once.
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
8
5
participants completed 1 min of math problems, and this applied to
participants in both the no-testing and the testing conditions. The List
4
test was administered in the same fashion as the interspersed tests. That
is, participants were instructed to recall as many words as possible from
List 4 in 60 s.
In the 25-min retention interval conditions, participants completed
the List 4 test following 25 min of brain teasers and the videogame
Tetris. The brain teasers were displayed on the computer screen using a
PowerPoint presentation and participants wrote their answers on paper.
The brain teaser task contained 12 questions designed to assess abstract
thinking and problem-solving skills (see the Appendix for examples). If
participants finished the brain teasers within 25 min, they played the
videogame Tetris for the remaining time. Tetris requires the arrange-
ment of cascading blocks into complete lines, which are then cleared
from the grid. Successful performance in Tetris likely requires effective
coordination between spatial imagery (e.g., mental rotation) and motor
skills.
Participants completed a source recognition test as the final task of
the experiment. On each trial, participants saw a studied word and
indicated its list membership (List 1, 2, 3, or 4) by pressing the corre-
sponding number key; they then rated their confidence on a scale from
1 (very unsure) to 8 (very sure). Because performance on this source
recognition test was necessarily contaminated by that of the recall tests,
we opted to present data from the source test in the supplementary
material and those data will not be discussed further.
Results and discussion
For all experiments, we first report results regarding the impact of
interspersed testing and retention interval on correct recall, semantic
clustering, and intrusions during the List 4 test. We then report recall
performance across lists for participants in the testing condition (who
were the only participants tested for the first three lists). Bayes factors
(B01, which indicates support for the null hypothesis over the
alternative hypothesis) were provided when the result did not meet
conventional level of statistical significance (i.e., α = .05).
List 4 recall
Correct recall. We conducted a 2 (intervening task: testing vs. no-
testing) × 2 (retention interval: 1 min vs. 25 min) between-subjects
ANOVA to examine the effects of interspersed testing and retention
interval on new learning (see the left side of Fig. 2). The dependent
variable in this ANOVA was the proportion of List 4 words correctly
recalled. The ANOVA revealed a main effect of intervening task, F(1,
142) = 31.55, p < .01, ηp
2 = .15. That is, participants who were tested
on Lists 1–3 exhibited greater recall of List 4 (M = .56) than
participants who were not tested (M = .33). The main effect of
retention interval was also significant, F(1, 142) = 36.13, p < .01,
ηp
2 = .20, with participants recalling fewer List 4 words after the 25-
min retention interval (M = .32) than the 1-min retention interval
(M = .56). Perhaps most important for present purposes, the
interaction between intervening task and retention interval was not
significant, F(1, 142) = 1.09, p = .30, ηp
2 < .01, B01 = 4.81, with the Bayes factor indicating that the data were nearly five times more probable under the null hypothesis than under the alternative hypothesis. This finding suggests that the beneficial effects of testing on new learning were observed at both the 1-min and 25-min retention intervals.
Clustering in recall. To investigate how interspersed testing influenced
participants’ use of strategies, we examined the likelihood with which
participants clustered related items together during recall. As stated in
the Method section, we spread words that belong to the same category
across four lists and randomized the presentation order within a list.
Consequently, words from the same category were often not presented
on consecutive encoding trials. Previous research has shown that
testing can improve semantic organization of studied material (Jing
et al., 2016; Zaromb & Roediger, 2010). If this were the case in the
Fig. 1. Experimental design for the three experi-
ments. Experiment 1 compared testing with no-
testing across the 1-min and 25-min retention in-
tervals. Experiment 2 compared testing with rest-
udying across the 1-min and 25-min retention in-
tervals. Experiment 3 compared testing with no-
testing and restudying at a 25-min lag. Experiment
4 compared testing with restudying and no-testing
across the 1-min and 25-min lags. S refers to study
(or restudy), M refers to math problems and this
phase lasted 1 min (hence the 1-min retention in-
terval at the top of the figure), T refers to an in-
terpolated free recall test, and L1-L4 in subscripts
refer to Lists 1–4, respectively.
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
86
present context, testing should increase the clustering of related items
during recall. Adjusted-ratio-of-clustering (ARC, Roenker, Thompson, &
Brown, 1971) quantifies the likelihood that related items follow each
other during output (i.e., clustering in recall), with positive ARC scores
indicating above chance clustering, 0 indicating chance level clustering,
and negative scores indicating below chance clustering. In this analysis,
we substituted an undefined ARC score with 0, which occurs when only
one item is recalled from each category or when all of the recalled items
are from the same category.
A 2 (testing vs. no-testing) × 2 (1 min vs. 25 min) ANOVA showed a
significant main effect for intervening task on List 4 ARC scores, F(1,
142) = 19.19, p < .01, ηp 2 = 0.11. The rightmost column in Table 1
depicts results of this analysis. Specifically, the tested participants
clustered their output to a much greater degree (M = .55) than the
nontested participants (M = .20). Retention interval also had an effect,
F(1, 142) = 4.54, p = .04, ηp
2 = .03, with participants clustering less at
the 25-min retention interval (M =0.29) than at the 1-min retention
interval (M = .46). The interaction, however, was not significant, F(1,
142) = 2.22, p = .14, ηp
2 = .02, B01 = 1.66. In sum, similar to the
correct recall data, the clustering data showed that the benefits of in-
terpolated testing on new learning persisted across the retention in-
terval.
.0
0
.10
.20
.30
.40
.50
.60
.70
.80
.
90
1-min Retention
Interval
25-min Retention
Interval
Li
st
4
(
P
ro
po
rt
io
n
R
ec
al
le
d)
0
1
2
3
4
5
6
1-min Retention
Interval
25-min Retention
Interval
Li
st
4
(
N
um
be
r o
f I
nt
ru
si
on
s)
No-testing Testing
Fig. 2. Correct List 4 recall and intrusions as a function of intervening task and retention interval in Experiment 1. Left panel shows proportion of correct recall; right
panel shows number of intrusions during List 4 recall. Error bars indicate descriptive 95% confidence intervals.
Table 1
ARC (Clustering) scores for experiments 1–4.
List 1 List 2 List 3 List 4
Experiment 1
1-min RI
No-testing 0.34 (0.50)
Testing 0.32 (0.43) 0.59 (0.37) 0.57 (0.36) 0.58 (0.48)
25-min RI
No-testing 0.05 (0.52)
Testing 0.48 (0.33) 0.41 (0.46) 0.65 (0.40) 0.52 (0.43)
Experiment 2
1-min RI
Restudying 0.20 (0.54)
Testing 0.29 (0.53) 0.56 (0.54) 0.58 (0.49) 0.57 (0.72)
25-min RI
Restudying 0.19 (0.46)
Testing 0.30 (0.35) 0.53 (0.29) 0.57 (0.43) 0.49 (0.79)
Experiment 3
25-min Lag
No-testing 0.36 (0.58)
Restudying 0.31 (0.52)
Testing 0.35 (0.48) 0.61 (0.45) 0.62 (0.43) 0.58 (0.50)
Experiment 4
1-min Lag
No-testing 0.27 (0.59)
Restudying 0.18 (0.76)
Testing 0.29 (0.51) 0.58 (0.45) 0.55 (0.46) 0.71 (0.38)
25-min Lag
No-testing 0.18 (0.58)
Restudying 0.22 (0.79)
Testing 0.37 (0.48) 0.52 (0.45) 0.66 (0.40) 0.59 (0.44)
Note: Standard deviations are in parentheses. Note that both the retention interval (RI) and lag manipulations did not occur until List 4.
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
87
Intrusions. In our experiments, participants were always told to recall
words from the just-studied list. Therefore, when they recalled words
from other lists, these items were considered intrusions. To examine the
frequency with which intrusions occurred during the List 4 test, we
conducted a 2 (testing vs. no-testing) × 2 (1 min vs. 25 min) ANOVA
with the number of intrusions as the dependent variable. The means for
this analysis are depicted in the right side of Fig. 2. The main effect of
testing was significant, F(1, 142) = 15.33, p < .01, ηp
2 = .10, such
that intrusions occurred less frequently in the testing condition
(M = 1.09) than in the no-testing condition (M = 2.71). Retention
interval also had a main effect, F(1, 142) = 22.71, p < .01,
ηp
2 = .14. Specifically, intrusions were about three times more likely
to occur (M = 2.
89
) at the 25-min interval than at the 1-min interval
(M = 0.
92
). Lastly, testing and retention interval did not interact, F(1,
142) = 1.49, p = .22, ηp
2 = .01, B01 = 2.18. Most important for
present purposes, it is clear from Fig. 2 that a TPNL effect on
intrusions persisted across the 25-min retention interval.
Recall across lists
We now examine recall performance across the four lists for parti-
cipants in the testing condition (who were the only participants tested
on all four lists) using a 4 (Lists 1–4) × 2 (1 min vs. 25 min) mixed
ANOVA. As expected, recall performance across lists differed depending
on whether participants were in the 1-min or 25-min retention interval
condition, and this impression was supported by the significant inter-
action between list and retention interval, F(3, 213) = 17.82, p < .01,
ηp
2 = .20. To further scrutinize this interaction, we conducted separate
repeated measures ANOVAs for participants in the two retention in-
terval conditions. For participants in the 1-min interval condition, recall
performance remained stable across all four lists (ML1 = .70, ML2 = .72,
ML3 = .72, M L4 = .70), F(3, 114) = 0.39, p = .76, ηp
2 = .01,
B01 = 19.34. This finding is consistent with the idea that interspersed
testing inoculates against the buildup of proactive interference. In
contrast, participants in the 25-min interval condition recalled fewer
items during the test for List 4 (M = .41) than for Lists 1–3 (ML1 = .68,
ML2 = .69, ML3 = .73), F(3, 99) = 31.99, p < .01, ηp
2 = .49. This was
to be expected, as the test for List 4 was delayed by 25 min.
We now examine the ARC clustering scores for participants in the
testing condition. For this analysis, we collapsed the data across the two
retention intervals, given that i) the procedure was identical for Lists
1–3, and ii) our previous results indicated that retention interval did not
affect the clustering of List 4 items for the tested participants. A re-
peated measures ANOVA showed that ARC scores rose across lists, F(3,
216) = 4.69, p < .01, ηp
2 = .06, with the ARC scores rising from 0.40
in List 1 to 0.50 in List 2, 0.60 in List 3, and 0.55 in List 4. It appears
that clustering reaching asymptote by List 3 (see Table 1 for means
separated by intervening tasks). An important question here is whether
clustering increased because participants became increasingly aware of
the categorical nature of the words as they studied the lists or because
participants were tested across lists. The former hypothesis suggests that
semantic organization was built across lists based on continued exposure
to related words. In contrast, the latter hypothesis suggests that ex-
posure alone was insufficient; rather, participants built organization
across lists through retrieval. If the exposure hypothesis is correct, then
the List 4 ARC score should not differ between participants in the
testing and no-testing conditions (because both groups had been ex-
posed to the same number of related words across lists). This is clearly
not the case. In fact, the List 4 ARC score for the nontested participants
(M = .34, at the 1-min retention interval) was similar to the List 1 ARC
score for the tested participants (M = .40, averaged across participants
in the 1-min and 25-min retention interval, but note that List 1 recall
actually occurred 1 min after encoding for all tested participants), t
(107) = 0.60, p = .55, d = .12, B01 = 3.98. This finding suggests that
continued exposure to related items did not facilitate semantic orga-
nization, but retrieval practice did.
Experiment 2
In Experiment 2, we replaced the no-testing condition with a rest-
udying condition as the control. This change was implemented to ex-
amine whether the benefit of testing on new learning was due, at least
in part, to the re-exposure of the same items studied prior to new
learning. Although the ARC results from Experiment 1 showed that
exposure to categorized words across lists did not increase recall clus-
tering in the absence of retrieval practice, it remains possible that re-
exposure to identical words, rather than continued exposure to related
(but different) words, was responsible for the enhanced clustering (and
enhanced new learning) observed for participants in the testing con-
dition. In the testing effect literature, researchers sometimes compare a
testing condition with a no-testing condition (Chan, 2010; e.g., Chan &
McDermott, 2007), and at other times compare a testing condition with
a restudying condition (e.g., Roediger & Karpicke, 2006). The latter
comparison has the advantage of eliminating differences in time-on-
task or item re-exposure between the testing and control conditions. In
a similar way, including a condition in which participants restudied the
words from Lists 1–3 allowed us to examine whether the TPNL effects
observed earlier were driven by re-exposure or retrieval (see also
Szpunar et al., 2008; Experiment 3). Specifically, testing might have
potentiated new learning of List 4 because recalling, and therefore re-
encoding, the studied words enhanced one’s semantic organization of
the materials. This enhanced semantic organization, or better recogni-
tion of the categorized structure of the lists, might potentiate encoding
of new words because it facilitated relational processing of the words in
a list in the testing condition (McDaniel & Einstein, 1989; McDaniel,
Einstein, & Waddill, 1990). If re-exposure to the categorized words
through restudying (or retrieval practice) potentiates new learning of
List 4 through enhanced semantic organization, then testing should not
potentiate new learning relative to restudying, as manifested by both
recall performance and clustering scores.
Method
Participants, design, materials, and procedure
Participants were 104 undergraduate students from Iowa State
University. Of these, one was eliminated from analysis due to an ex-
perimenter error, one was eliminated because English was not his/her
primary language, and two were eliminated because they failed to
follow instructions. Therefore, 100 participants were included in the
final analyses, with 33 in the restudying, 1-min interval condition, 16 in
the testing, 1-min interval condition, 34 in the restudying, 25-min interval
condition, and 17 in the testing, 25-min interval condition. There were
fewer participants in the testing conditions than in the restudying
conditions because the former were direct replications of the same
conditions from Experiment 1. As will be clear from the Results to
follow, the data from the present testing condition closely mirrored
those from Experiment 1.
Experiment 2 used a 2 (Intervening task: testing vs. restudying) × 2
(Retention interval: 1 min vs. 25 min) between-subjects design. The
materials and procedure were identical to those in Experiment 1 with
the following exceptions. During the presentation of Lists 1–3, partici-
pants in the restudying condition first studied each word in a list twice
(i.e., identical to Experiment 1). Next, they completed math problems
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
88
for 60 s, and then they restudied the words in the same list twice again,
with a fresh random order for each presentation of a given list.
Therefore, for Lists 1–3, participants in the restudying condition en-
coded each item four times, whereas participants in the testing condi-
tion encoded each item twice.2 Most importantly, however, List 4 was
presented in the same manner for all participants, such that each word
was studied only twice before they were tested.
Results and discussion
List 4 recall
Correct recall. A 2 (testing vs. restudying) × 2 (1 min vs. 25 min)
between-subjects ANOVA was conducted to examine the effects of
testing and retention interval on the proportion of correct recall in List
4 (see left side of Fig. 3). A main effect of intervening task was found,
such that participants recalled more List 4 items when they were tested
on Lists 1–3 (M = .57) than when they restudied Lists 1–3 (M = .25), F
(1, 96) = 37.11, p < .01, ηp
2 = .28. A main effect of retention interval
was also found, F(1, 96) = 15.16, p < .01, ηp
2 = .14, which indicated
that participants recalled more words after a 1-min retention interval
(M = .52) than after a 25-min retention interval (M = .31). Unlike
Experiment 1, however, the interaction between intervening task and
retention interval was significant, F(1, 96) = 5.22, p = .03, ηp
2 = .05.
At the 1-min retention interval, participants in the testing condition
recalled far more words from List 4 (M = .75) than participants in the
restudying condition (M = .29), t(47) = 5.88, p < .01, d = 1.79. At
the 25-min retention interval, participants in the testing condition still
recalled more words from List 4 (M = .41) than participants in the
restudying condition, (M = .21), t(49) = 2.71, p < .01, d = .81, but
the benefit of testing on new learning here was weaker than that
observed at the 1-min retention interval.
Despite this finding, we believe that it might be premature to con-
clude that the TPNL effect had weakened across the retention interval
for two reasons. First, the data of Experiment 1 indicated that the TPNL
effect, relative to no-testing, persisted over the delay. It is difficult to
envision why the effect would decline over time when the comparison
condition was restudying instead of no-testing, given that restudying
information does not typically slow forgetting (Carpenter, Pashler,
Wixted, & Vul, 2008). Second, and most importantly, participants in the
restudying condition exhibited very low recall performance at the 1-
min retention interval. This created a situation whereby considerably
less forgetting was possible in the restudying condition than in the
testing condition. In other words, the significant interaction between
interpolated task and retention interval might have been an artifact of
the poor initial recall performance in the restudying condition.
We also note here that participants in the restudying condition re-
called considerably fewer List 4 items in the 1-min retention interval
condition (M = .29) than participants in the no-testing condition in
Experiment 1 (M = .43), t(67) = 2.06, p = .02, d = .50. Although this
restudy deficit may seem odd at first glance, it is not unusual. In fact,
prior research on TPNL has reported similar patterns (e.g., Szpunar
et al., 2008). We attribute this restudy deficit in new learning (relative
to no-testing) to the continuous buildup of proactive interference
during the encoding of Lists 1–3. Specifically, in the restudying condi-
tion, participants encoded Lists 1–3 twice as often as participants in the
no-testing condition. Therefore, by the time List 4 was presented for
encoding, participants in this condition had already encoded a total of
six lists (although only three unique lists), whereas participants in the
no-testing condition had only encoded a total of three lists. We believe
that repeated studies of the first three lists might have impaired recall of
List 4 relative to no-testing due to an increase in response competition.
Clustering in recall. Clustering in List 4 recall (ARC score) was examined
using a 2 (testing vs. restudying) × 2 (1 min vs. 25 min) ANOVA (see
the rightmost column of Table 1). The main effect of intervening task
was significant, F(1, 96) = 7.10, p = .01, ηp
2 = .07, such that testing
led to greater clustering (M = .53) than restudying (M = .20).
Consistent with Experiment 1, retention interval had little impact on
clustering, F(1, 96) = 0.13, p = .72, ηp
2 = .001, B01 = 4.61, with
participants producing similar levels of clustering at the 1-min
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1-min Retention
Interval
25-min Retention
Interval
Li
st
4
(
P
ro
po
rt
io
n
R
ec
al
le
d)
0
1
2
3
4
5
6
1-min Retention
Interval
25-min Retention
Interval
Li
st
4
(
N
um
be
r o
f I
nt
ru
si
on
s)
Restudying Testing
Fig. 3. Correct List 4 recall and intrusions as a function of intervening task and retention interval in Experiment 2. Left panel shows proportion of correct recall; right
panel shows number of intrusions during List 4 recall. Error bars indicate descriptive 95% confidence intervals.
2 Because we wanted the restudy opportunity to mirror that of the original study op-
portunity, we presented the study list twice during the restudy trial. This procedure,
however, also increased the time-on-task for participants in the restudying condition
relative to the testing condition. Specifically, for Lists 1–3, participants in the restudying
condition encoded the list words for 2 min, which was followed by 1 min of math, and
then they restudied the list words for another 2 min. In contrast, participants in the testing
condition studied the list words for 2 min, then they did 1 min of math, and then spent
1 min recalling words from that list. To ensure that our results could not be attributed to
this procedural difference, we also collected data for the restudying condition under
which the restudy trial presented each list word only once. Participants in this restudying
condition thus spent the same amount of time on task as their tested counterparts. The
data (N = 22 for the 1-min condition and N = 21 for the 25-min condition) were highly
similar to those reported in the present paper (i.e., the restudy twice participants).
Specifically, proportion of List 4 recall was .34 in the 1-min condition and .15 in the 25-
min condition, and number of intrusions was 1.41 in the 1-min condition and 5.00 in the
25-min condition. Most importantly, consistent with the results reported in the main text,
interpolated testing enhanced new learning relative to restudying in this new sample in
both the 1-min condition, tcorrect(36) = 4.86, pcorrect < .01, tintrusion(36) = 2.02,
pintrusion = .03, and the 25-min condition, tcorrect(36) = 3.31, pcorrect < .01,
tintrusion(36) = 3.21, pintrusion < .01.
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
89
(M = .39) and 25-min retention intervals (M = .34). In addition, the
interaction between intervening task and retention interval was not
significant, F(1, 96) = 0.09, p = .77, ηp
2 = .001, B01 = 3.23. Once
again, these data suggest that the benefits of testing (relative to
restudying) persisted across the 25-min delay.
Intrusions. A 2 × 2 ANOVA was conducted to examine the effects of our
independent variables on intrusions (see right side of Fig. 3). Here, a
main effect was observed for intervening task, F(1, 96) = 16.02,
p < .01, ηp
2 = .14, such that participants who were tested on Lists
1–3 produced fewer intrusions during List 4 recall (M = .78) than those
who restudied Lists 1–3 (M = 3.59). In addition, a significant main
effect was found for retention interval, F(1, 96) = 6.19, p = .02,
ηp
2 = .06, with participants producing more intrusions at the 25-min
retention interval (M = 3.06) than at the 1-min retention interval
(M = 1.31). Similar to the results of Experiment 1, the interaction
between testing and retention interval was not significant, F(1,
96) = 1.82, p = .18, ηp
2 = .02, B01 = 1.42. Once again, as can be
seen clearly in Fig. 3, the intrusion data showed that the benefits of
retrieval on new learning remained through the delay.
Recall across lists
Recall performance across lists for participants in the testing con-
dition was analyzed using a 4 (List 1–4) × 2 (1 min vs. 25 min) mixed
ANOVA. Similar to the results in Experiment 1, list and retention in-
terval interacted, F(3,
93
) = 8.59, p < .01, ηp
2 = .22. Whereas parti-
cipants in the 1-min condition performed similarly across all lists,
(ML1 = .72, ML2 = .76, ML3 = .73, M L4 = .75), F(3, 45) = 0.44,
p = .73, ηp
2 = .03, B01 = 7.71, those in the 25-min condition recalled,
as expected, substantially fewer words from List 4 than from the re-
maining lists (ML1 = .69, ML2 = .68, ML3 = .73, M L4 = .41), F(3,
48) = 11.83, p < .01, ηp
2 = .43.
Clustering (ARC scores) across lists for the tested participants was
analyzed using a repeated measures ANOVA (see Table 1). We again
collapsed the data across retention interval for the same reasons as
described in Experiment 1. There was a marginally significant effect of
list on the ARC scores, F(3, 96) = 2.63, p = .05, η2 = .08, with the ARC
score for List 1 being lower than Lists 2–4 (ML1 = .29, ML2 = .54,
ML3 = .57, ML4 = .53). Similar to the results of Experiment 1, these
data showed that clustering peaked by List 3, with the greatest gain
observed between Lists 1 and 2.
To examine whether re-exposure to categorized words was able to
increase clustering in the absence of retrieval practice, we compared
the List 4 ARC scores for participants in the restudying condition with
the List 1 ARC scores for participants in the testing condition. The List 4
ARC score for participants in the restudying condition (M = .20 after 1-
min of math) did not differ from the List 1 ARC scores for participants in
the testing condition (M = .29), t(64) = 0.78, p = .44, d = .12,
B01 = 3.05. This finding is consistent with that from Experiment 1, and
it suggests that retrieval practice, rather than repeated exposures to
related words, increased semantic organization during subsequent re-
trieval.
Experiment 3
The purpose of Experiment 3 was to determine the effects of de-
laying new learning (rather than delaying the test for new learning) on
the TPNL effect. Specifically, for all participants in this experiment, a
25-min lag occurred between the intervening task of List 3 and the
encoding of List 4, during which participants completed brain teasers
and played Tetris (the same tasks used during the retention interval in
Experiments 1 and 2). We did not include a no-lag condition in this
experiment because such a condition was identical to the 1-min re-
tention interval condition in Experiments 1 and 2. It is important to
note that, unlike interspersed testing, having participants do math
problems between encoding episodes does not potentiate new learning
(Szpunar et al., 2008; Weinstein et al., 2011; Wissman et al., 2011). At
first glance, this result seems to suggest that testing is special, because
other intervening activities (e.g., doing math problems) do not enhance
new learning. However, there are two reasons to be cautious in drawing
this conclusion. First, the lag between original learning and new
learning is very short in these studies (about 1 min, Allen & Arbak,
1976; Arkes & Lyons, 1979; Nunes & Weinstein, 2012; Szpunar, Jing, &
Schacter, 2014; Tulving & Watkins, 1974; Weinstein et al., 2014),
which might be inadequate to serve as a study break. Second and more
importantly, the prevailing wisdom emerging from the context change
literature is that doing math problems does not trigger context change
from encoding (Abel & Bauml, 2016; Klein, Shiffrin, & Criss, 2011;
Sahakyan & Hendricks, 2012). To alleviate these concerns, we opted for
lag activities that were very different from the encoding task and to
substantially expand the duration of the lag from 1 min to 25 min,
which was nearly double the duration of the entire encoding task for
Lists 1–3. If prior episodic retrieval is necessary to enhance new
learning, then a TPNL effect should be observed even when List 4 is
encoded after a 25-min lag. In contrast, if interspersed testing enhances
new learning because it serves a function similar to inserting a study
break, then TPNL should not occur after the lag.
Method
Participants, design, materials, and procedure
Participants were 127 undergraduate students from Iowa State
University. Six participants were eliminated due to an experimenter
error, 10 because English was not their primary language, and one
because the participant failed to follow instructions. Therefore, 110
participants were included in the final analyses, with 39 in the no-testing
condition, 33 in the restudying condition, and 38 in the testing condition.
The materials and procedure were identical to those used in
Experiments 1 and 2. As depicted in Fig. 1, the only difference between
Experiment 3 and Experiments 1 and 2 was that the 25-min delay
preceded the encoding of List 4 rather than the test for List 4.3
Results and discussion
List 4 recall
Correct recall. A one-way between-subjects ANOVA showed a main
effect of intervening task (testing, no-testing, restudying) on List 4
correct recall (see the left side of Fig. 4), F(2, 107) = 12.22, p < .01,
η2 = .19.4 Specifically, participants in the testing condition recalled
more List 4 items (M = .71) than those in the no-testing condition
(M = .43), t(75) = 5.11, p < .01, d = 1.16, and the restudying
condition (M = .50), t(69) = 3.43, p < .01, d = .82. No significant
difference in List 4 recall was observed between the participants in the
no-testing and restudying conditions, t(70) = 1.12, p = .27, d = 0.27,
B01 = 2.40. These results show that interspersed testing enhanced new
learning despite the 25-min lag, during which participants completed a
series of tasks unrelated to episodic encoding. This finding suggests that
a study break alone, even one as long as 25 min, does not potentiate
new learning, at least when the dependent variable is correct recall.
Instead, prior retrieval appears necessary to alter how participants
encode and/or retrieve new information. We describe this idea in detail
in the General Discussion.
3 Similar to Experiment 2, participants in the restudying condition re-encoded the
items in Lists 1–3 twice, which increased their time-on-task by 60 s per list relative to
participants in the testing and no-testing conditions. To address this difference in meth-
odology, we tested an additional group of participants (N = 23) who restudied each word
only once. Proportion of correct recall was .53 and number of intrusions was 1.35 for this
group of participants. Similar to the conclusion in the main text, interpolated testing
enhanced List 4 recall relative to restudying, t(59) = 2.76, p < .01. However, unlike the
results in the main text, interpolated testing reduced List 4 intrusions relative to rest-
udying, t(59) = 2.03, p = .02, although the difference was modest.
4 In a single independent variable ANOVA, η2 is the same as ηp
2.
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
90
Clustering in recall. A one-way ANOVA showed a marginal effect of
intervening task on semantic clustering during List 4 recall, F(2,
107) = 2.67, p = .07, η2 = .05 (see Table 1). Specifically,
interspersed testing led to marginally greater clustering (M = .58)
than no-testing (M = .36), t(75) = 1.79, p = .08, d = 0.40,
B01 = 1.07, and significantly greater clustering than restudying
(M = .31), t(69) = 2.25, p = .03, d = 0.54, B01 = 2.05.
Intrusions. Unlike the results of Experiments 1 and 2, participants
exhibited few intrusions during List 4 recall regardless of the nature
of the intervening task (see the right side of Fig. 4), F(2, 107) = 2.01,
p = .14, η2 = .04, B01 = 2.27. Planned comparisons showed that
participants in the testing condition produced significantly fewer
intrusions (M = 0.53) when compared to the no-testing condition
(M = 1.08), t(75) = 2.13, p = .04, d = 0.49, but not when compared
to the restudying condition, (M = 0.79), t(69) = 1.05, p = .30,
d = 0.25, B01 = 2.54. Intrusion rates also did not differ between the
no-testing and restudying conditions, t(70) = 0.86, p = .39, d = 0.20,
B01 = 2.98. These findings contrast with those from Experiments 1 and
2, in which we observed substantially more intrusions 1 min after List 4
encoding in the no-testing (M = 1.47) and restudying conditions
(M = 2.24) than in the present experiment. Therefore, although the
lag had little impact on the magnitude of the TPNL effect for correct
recall, it reduced the effect for intrusions.
Recall across lists
Similar to Experiments 1 and 2, recall across lists remained stable
for participants in the testing condition, (ML1 = .66, ML2 = .68,
ML3 = .72, M L4 = .71), F(3, 111) = 1.29, p = .28, η
2 = .03,
B01 = 6.41. Once again, this result shows that the lag did not affect
learning of List 4 for the tested participants.
A repeated measures ANOVA revealed that clustering increased
across lists (ML1 = .35, ML2 = .61, ML3 = .62, ML4 = .58), F(3,
111) = 4.46, p < .01, η2 = .108. Consistent with the data from
Experiments 1 and 2, these ARC scores showed that clustering reach
asymptotic level by List 3, with the greatest gain observed between Lists
1 and 2. In addition, the List 4 ARC score for participants in the no-
testing condition (M = .36) and the restudying condition (M = .31)
were comparable to the List 1 ARC score (M = .35) for the participants
in the testing condition, F(2, 107) = 0.09, p = .92, B01 = 10.96. Once
again, this finding indicates that repeated exposure to categorized words
did not foster clustering in subsequent recall, but retrieval practice did.
Experiment 4
The results of Experiment 3 clearly showed that the benefits of re-
trieval on new learning remained robust despite the 25-min lag prior to
new learning. However, because we did not include a short-lag condi-
tion in Experiment 3, conclusions regarding the effects of lag must be
inferred on the basis of cross-experimental comparisons (i.e., against
the 1-min retention interval condition in Experiments 1 and 2).
Therefore, in Experiment 4, we attempted to replicate and extend the
findings of Experiment 3 with the addition of a 1-min lag condition. Our
objective was to compare the effects of a short- vs. a long-lag in a single
experiment. To this end, we conducted an experiment in which we
manipulated both lag (1 min, 25 min) and intervening task (no-testing,
restudying, testing) between-subjects.
Method
Participants, design, materials, and procedure
A total of 238 participants participated in this experiment. The data
from five participants were omitted from the analysis: two due to
English not being their primary language, one due to an experimenter
error, one due to data corruption, and one due to the participant not
following instructions. The final data set therefore included 233 parti-
cipants, with 36 in the no-testing, 1-min lag condition, 42 in the no-
testing, 25-min lag condition, 36 in the restudying, 1-min lag condition,
41 in the restudying, 25-min lag condition, 37 in the testing, 1-min lag
condition, and 41 in the testing, 25-min lag condition.
The procedure for Experiment 4 was identical to that of the previous
experiments, except that half of the participants were in the 1-min lag
conditions (similar to the short retention interval conditions in
Experiments 1 and 2) and the remaining participants were in a 25-min
lag conditions (similar to Experiment 3).
Results and discussion
List 4 recall
Correct recall. The data for Experiment 4 were consistent with those
from Experiments 1–3, with interpolated testing producing a
substantial benefit on new learning relative to both no-testing and
restudying, and this benefit persisted through the 25 min lag (see
Fig. 5). These impressions were supported by the results of a 3 (testing,
restudying, no-testing) × 2 (1-min lag, 25-min lag) between-subjects
ANOVA. Specifically, there was a main effect of interpolated task, F(2,
227) = 41.74, p < .01, ηp
2 = .27, with the tested participants recalling
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
25-min Lag
Li
st
4
(
P
ro
po
rt
io
n
R
ec
al
le
d)
0
1
2
3
4
5
6
25-min Lag
Li
st
4
(
N
um
be
r o
f I
nt
ru
si
on
s)
No-testing Restudying Testing
Fig. 4. Correct List 4 recall and intrusions as a function of intervening task in Experiment 3. Left panel shows proportion of correct recall; right panel shows number
of intrusions during List 4 recall. Error bars indicate descriptive 95% confidence intervals.
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
91
more List 4 words (M = .70) than both the nontested participants
(Mnontested = .43), t(154) = 7.23, p < .01, d = 1.16, and the restudied
participants (Mrestudied = .36), t(153) = 8.75, p < .01, d = 1.41.
Further, neither the main effect of lag, F(1, 227) = 1.59, p = .201,
ηp
2 < .01, B01 = 4.29, nor the interaction between interpolated task and lag was significant, F(2, 227) = 0.93, p = .40, ηp
2 < .01, B01 = 5.59. Indeed, an examination of Fig. 5 shows clearly that the benefits of testing on new learning were robust in both lag conditions.
Before turning to the data on semantic clustering, we note an in-
teresting finding that is similar to one we reported in Experiment 2.
Specifically, participants in the restudying condition recalled fewer List
4 words (M = .31) than participants in the no-testing condition
(M = .42) – a restudy deficit, t(70) = 2.05, p = .04, d = 0.48. Once
again, we interpret this finding as representative of the fact that re-
peatedly studying the list items allowed more proactive interference to
build up across list relative to studying each list only once, which fur-
ther suppressed learning of List 4.
Clustering in recall. Clustering in List 4 recall (ARC score) was examined
in a 3 (testing vs. restudying vs. no-testing) × 2 (1-min lag vs. 25-min
lag) ANOVA (see Table 1), and the results mirrored those from correct
recall. The main effect of intervening task was significant, F(2,
212) = 12.99, p < .01, ηp
2 = .07. Specifically, participants who were
tested on Lists 1–3 were far more likely to cluster their recall during List
4 (M = 0.65) than participants who were not tested (M = 0.21), t
(150) = 5.24, p < .01, d = 0.85, and participants who restudied those
lists (M = 0.22), t(142) = 4.30, p < .01, d = 0.72. Moreover, neither
lag, F(1, 212) = 0.39, p = .54, ηp
2 < .01, B01 = 5.38, nor its interaction with intervening task was significant, F(2, 212) = 0.44, p = .65, ηp
2 = 0.01, B01 = 11.00.
Intrusions. A 3 × 2 ANOVA showed a main effect of intervening task, F
(2, 227) = 10.60, p < .01, ηp
2 = .09, a main effect of lag, F(1,
227) = 3.98, p = .05, ηp
2 = .02, and an interaction that was
marginal, F(2, 227) = 2.48, p = .09, ηp
2 = .02. The main effect of
intervening task showed that testing reduced the number of intrusions
(M = 0.27) during List 4 recall relative to restudying (M = 1.66), t
(153) = 4.62, p < .01, d = 0.74, and no-testing (M = 1.34), t
(154) = 3.90, p < .01, d = 0.62. To further examine the effects of
lag on intrusions, we conducted separate t-test for each intervening task
condition. As can be seen in the right panel of Fig. 5, performing
retrieval practice on Lists 1–3 nearly eliminated intrusions during List 4
recall, regardless of whether a 1-min (M = 0.19) or 25-min lag
(M = 0.34) preceded the encoding of List 4, t(76) = 1.35, p = .18,
d = 0.31, B01 = 1.
94
. However, increasing the lag from 1 min to 25 min
reduced intrusions for participants in the no-testing condition (M1-
min = 1.97, M25-min = 0.74), t(76) = 2.53, p = .01, d = 0.57, a
conclusion consistent with the one from Experiment 3. In contrast,
although increasing the lag also reduced intrusions for participants in
the restudying condition (M1-min = 1.89, M25-min = 1.44), t(75) = 0.76,
p = .45, d = 0.17. B01 = 3.30, the effect was not significant. Notably,
this latter conclusion differed from that based on Experiment 3, in
which participants in the restudy condition showed fewer intrusions
(M25-min = 0.79) than their counterparts in Experiment 2 (M1-
min = 2.24). We suspect that this discrepancy might simply be the
result of sampling differences. To obtain a more representative result,
we examined the effects of lag on intrusion for the restudy participants
by combining the data from Experiments 2 (1-min lag), 3 (25-min lag),
and 4 (1-min lag and 25-min lag). The outcome of this analysis revealed
a significant, but modest, effect of lag on intrusions, t(141) = 2.21,
p = .03, d = 0.37, with participants producing more intrusions during
List 4 recall following a 1-min lag (M = 2.06) than following a 25-min
lag (M = 1.15).
Recall across lists
Recall performance across lists for participants in the testing con-
dition was analyzed using a 4 (List 1–4) × 2 (1-min lag vs. 25-min lag)
mixed ANOVA. Similar to Experiment 3, recall probabilities remained
stable across lists, F(3, 228) = 0.23, p = .87, ηp
2 < .01, B01 = 51.76. Moreover, lag had no effects on recall overall, F(1, 76) = 0.28, p = .60, ηp
2 < .01, B01 = 3.35, nor did it interact with lists, F(3, 228) = 0.19, p = .90, ηp
2 < .01, B01 = 24.08. In the 1-min lag condition, propor- tions recalled from List 1–4 were .68, .67, .68, and .70, respectively, and in the 25-min lag condition, they were .69, .70, .71 and .70.
We analyzed the ARC scores across lists for the tested participants
using a repeated measures ANOVA. Similar to the results from
Experiments 1–3, the data showed that ARC scores rose across lists,
with a majority of the increase occurring between Lists 1 and 2
(ML1 = 0.32, ML2 = 0.55, ML3 = 0.63, ML4 = 0.66), F(3, 228) = 12.15,
p < .01, ηp
2 = .14). Moreover, the List 4 ARC scores for participants in
the no-testing condition (Mno-testing = 0.22) and the restudying condi-
tion (Mrestudying = 0.21) did not differ from the List 1 ARC scores for
participants in the testing condition, ts < 1.32, ps > .19, ds < 0.19,
B01s > 2.58. This finding, once again, suggests that neither continued
exposures to the lists (i.e., by studying three inter-related lists in the no-
testing condition) nor repeated exposures to the lists (i.e., by restudying
each list) improved clustering during the recall of List 4.
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1-min Lag 25-min Lag
Li
st
4
(
P
ro
po
rt
io
n
R
ec
al
le
d)
0
1
2
3
4
5
6
1-min Lag 25-min Lag
Li
st
4
(
N
um
be
r o
f I
nt
ru
si
on
s)
No-testing Restudy Testing
Fig. 5. Correct List 4 recall and intrusions as a function of lag and intervening task in Experiment 4. Left panel shows proportion of correct recall; right panel shows
number of intrusions during List 4 recall. Error bars indicate descriptive 95% confidence intervals.
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
92
General discussion
In four experiments, we found that interspersing retrieval practice
between encoding episodes enhanced new learning relative to both a
no-testing and a restudying baseline. The critical findings can be sum-
marized as follows. First, as indicated by List 4 correct recall, testing
enhanced new learning relative to no-testing and restudying, and this
effect occurred regardless of the length of retention interval and lag.
Second, based on the intrusion data, testing potentiated new learning
relative to no-testing and restudying at both the 1-min and 25-min re-
tention intervals. However, increasing the lag between original learning
and new learning also substantially reduced intrusions for both the no-
testing and restudying participants, thus reducing the advantage of
testing in this regard. Third, as indicated by the ARC clustering scores,
testing enhanced semantic organization during List 4 recall relative to
both no-testing and restudying. Moreover, this benefit of prior retrieval
on clustering scores persisted across both the 25-min retention interval
and lag. We now discuss the theoretical and practical implications of
these findings.
The persistence of test-potentiated new learning
As we have described in the Introduction, the true influence of re-
tention interval (i.e., in the absence of contamination from a prior test)
on the TPNL effect was previously unknown. Prior attempts at ex-
amining the persistence of the TPNL effect have typically used a re-
peated testing procedure, in which the delayed test for new learning
(similar to the List 4 test in the present experiments) was repeated
across both the shorter and longer retention intervals. Amongst these
studies, some have observed nearly equivalent magnitudes of TPNL at
both a 1-min and 30-min retention interval (Szpunar et al., 2008),
whereas others have found the effect to diminish considerably from an
immediate test to a 15-min delayed test (Wissman & Rawson, 2015).
Because these studies administered the memory test for new learning
over multiple occasions, it is difficult to ascribe differences in perfor-
mance, or lack thereof, between the earlier and later tests to retention
interval alone. Specifically, any reduction in the TPNL effect on the
second test could be due to a beneficial effect of the first criterial test in
the control conditions. Additionally, testing the new learning materials
repeatedly might alter the retrieval processes that one invokes during
recall. For example, Pierce et al. (2017) argued that prior testing of the
original learning materials renders the new learning materials dis-
tinctive – because the new learning materials are the only items that
have not yet been tested – this distinctiveness in turn facilitates post-
retrieval monitoring, which allows participants to reduce intrusions in
recall. To test this idea, Pierce et al. had their participants take a test for
the new learning materials twice, with the second test occurring just
two minutes after the first. Their logic was that any distinctiveness
advantage enjoyed by the new-learning items would be removed by the
first test, after which all studied items would have been tested once.
Consistent with this idea, the TPNL-associated reduction in intrusions
was markedly weakened in the second test. In short, testing the critical
new-learning items across multiple occasions does not provide the ideal
paradigm to examine the influence of retention interval on test-po-
tentiated new learning.
In Experiments 1 and 2, we examined the persistence of the TPNL
effect across a shorter (1 min) and a longer (25 min) retention interval
without the contamination of repeated testing, and our results showed
that retrieval potentiated new learning at both retention intervals.
Notably, the benefits of testing on new learning were observed re-
gardless of whether the baseline condition was no-testing (Experiment
1) or restudying (Experiment 2), and whether the dependent measure
was accurate recall, intrusions, or semantic clustering. Across these
analyses, only one (out of six) showed that the TPNL effect was sig-
nificantly weaker at the 25-min retention interval than at the 1-min
retention interval (i.e., List 4 correct recall in Experiment 2). We
caution against over-interpreting this result because, as described ear-
lier, this finding was likely driven by the very poor recall performance
at the 1-min retention interval for the restudy participants. Moreover,
when the data from the three dependent variables (i.e., correct recall,
intrusions, and clustering scores) were evaluated as a whole, they
showed that a robust TPNL effect can be found at both the 1-min and
25-min retention intervals. Despite these promising findings, a note of
caution is in order: we manipulated retention interval at a relatively
modest scale with only two time points (i.e., 1 min vs. 25 min).
Consequently, our understanding of the persistence of TPNL will benefit
from future investigations that include more time points and longer
retention intervals.
Explaining testing-potentiated new learning
Why does testing potentiate new learning? One possibility is that a
switch in context conferred by testing isolates the original learning
episode (i.e., the list studied before retrieval practice) from the new
learning episode (i.e., the list studied after retrieval practice), similar to
the effects of taking a break from studying. Alternatively, taking a test
may help participants switch to more effective strategies for future
encoding and/or retrieval (Chan et al., 2017; Cho et al., 2017).
In the present study, we tested the context change account using the
lag manipulation and the strategy change account with list-by-list
clustering analyses. If retrieval potentiates subsequent learning because
it alters task context, the lag activities should do the same, and one
should not observe a significant TPNL effect in Experiment 3. Based on
the correct recall and the semantic clustering data, the 25-min lag had
little impact on test-potentiated new learning, as the TPNL effect re-
mained robust despite the lag. These findings suggest that the benefits
of testing on new learning go beyond simply providing a break (or a
change in context) from encoding activities. A potential concern with
this conclusion is that perhaps our lag manipulation failed to cause
context change for participants who did not take the interpolated test.
Contrary to this possibility, the intrusion data indicate that the 25 min
lag had a beneficial effect for participants in the no-testing and rest-
udying conditions, such that they produced fewer intrusions following
the 25-min lag than following the 1-min lag. The intrusion results thus
indicate that the lag was likely successful at inducing a context change
because it helped participants isolate Lists 1–3 from List 4. A potential
argument here is that perhaps retrieval induces a more powerful con-
text change than activities that do not involve retrieval. However, we
find this argument unconvincing due to its circularity (i.e., retrieval
enhances new learning more than a study break because retrieval
changes context more than a study break).
We believe that the present results, and the ARC data in particular,
are consistent with the idea that prior retrieval enhances new learning
because it causes participants to use superior encoding/retrieval stra-
tegies. This idea is not entirely new, as researchers have recently pro-
posed that testing may cause learners to shift to more “efficient” or
more “elaborative” encoding strategies (Cho et al., 2017; Gordon &
Thomas, 2017). However, it is not yet clear what strategies are con-
sidered more efficient. With the present materials, we interpreted our
results as follows: During retrieval practice (but not during restudying
or no-testing), participants became sensitive to the categorical structure
of the lists when they used a recalled item to cue the retrieval of other
studied items (Carpenter, 2011; Chan, McDermott, & Roediger, 2006;
Pyc & Rawson, 2010). For example, participants might recall the word
“thunder,” which might serve as a retrieval cue for “wind.” We believe
this associative cuing among retrieval candidates can happen sponta-
neously, which in turn alters participants’ encoding strategy for the up-
coming study lists in two important ways. First, prior retrieval might
bias participants’ encoding strategy toward detecting related words
within a study list, which should enhance relational processing among
these words (Hunt & McDaniel, 1993) and strengthen their retention.
Second, this bias toward processing relational elements of the words
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
93
might increase the likelihood that related words from prior lists would
be spontaneously retrieved (Hintzman, 2004), which in turn facilitates
the integration of these items across lists (Wahlheim, 2015). Together,
these mechanisms might be responsible for the test-potentiated new
learning effect, at least as it pertains to the present materials. In the
current paper, we refer to this explanation as a strategy change account,
while acknowledging that this account incorporates ideas that are not
explicitly based on changes in encoding strategy, such as recursive re-
minding and study-phase retrieval (Hintzman, 2009; Jacoby,
Wahlheim, & Kelley, 2015; Wahlheim, 2015). To be clear, we believe
that testing can induce an encoding strategy change that may cascade to
other processes that are beneficial for new learning.
If enhanced relational encoding of the categorized words con-
tributes to the TPNL effect, one should expect that the magnitude of this
effect would be amenable to manipulations of presentation order of the
words. Specifically, in the present experiments, we always presented
words for encoding in a random order, which obscured the categorical
structure of the list. If testing potentiates new learning because it fa-
cilitates relational encoding of the words, then its benefits should be
reduced when words belonging to the same category are presented in
blocks (e.g., consecutively). Presenting related words in blocks should
encourage relational processing, thereby minimizing the difference in
processing orientation between the tested and nontested participants.
This prediction is borne out in a study by Nunes and Weinstein (2012),
in which participants studied words from the Deese-Roediger-
McDermott (DRM) associative lists (1995). In one experiment, the
words from each DRM list (e.g., hill, valley, summit) were spread across
five study lists. Participants either received retrieval practice after each
of the first four lists or not, and then all participants were tested on List
5. Similar to the present experiments, interspersed testing promoted
learning of the words in List 5. Critically, in their Experiment 2, Nunes
and Weinstein presented all words of a given DRM list together in List
1–4 (instead of spreading the words across lists), such that each study
list corresponded to a single DRM association (e.g., all the words in List
1 were related to mountain, all the words in List 2 were related to soft).
Consistent with the strategy change account proposed here, the TPNL
effect was absent with this blocked presentation method, presumably
because 1) the blocked presentation no longer allows relational pro-
cessing of items across lists for the tested participants or 2) the blocked
presentation naturally invited relational processing within list for the
nontested participants. However, an alternative explanation is also
possible. Specifically, the TPNL effect might not have occurred when
each study list consisted of a different set of semantic associates because
presenting related words in a blocked fashion might have prevented the
buildup of proactive interference in the control, non-tested condition.
This logic is based on the finding that switching semantic categories
during encoding can release learners from proactive interference
(Wickens, 1970). But perhaps more importantly, how does changing
semantic categories release learners from proactive interference? One
possibility is that changing semantic categories evokes a context change
(Bauml & Kliegl, 2013). But as we have discussed extensively above, we
do not believe that a context change account serves as the best ex-
planation for testing-potentiated new learning. Consequently, we argue
here that the strategy change account is a more viable (and testable)
explanation for both the present findings and prior findings (Nunes &
Weinstein, 2012).
One may question whether this strategy change account can explain
the TPNL effect in other situations, such as when participants study
unrelated word lists (Aslan & Bauml, 2015; Pastotter & Bauml, 2014;
Pastotter, Weber, & Bauml, 2013; Pastotter et al., 2011) or more com-
plex materials like video lectures or text passages (Chan & LaPaglia,
2011; Gordon & Thomas, 2014; Szpunar et al., 2013; Szpunar et al.,
2014; Wissman & Rawson, 2015; Wissman et al., 2011). Because un-
related words do not normally lend themselves to relational processing,
they may reduce any categorical processing advantage induced by prior
testing. Consequently, one may argue that the strategy change account
would not predict a TPNL effect with unrelated word lists – unless one
expands the idea of a strategy change to relational processing to include
ad-hoc relations. For example, when performing retrieval practice of
unrelated words, participants may notice or generate ad-hoc associa-
tions among these words, and they can then apply this relational en-
coding strategy when studying subsequent lists. The effort required to
produce these ad-hoc relations would likely be greater than that needed
to process the pre-existing associations for semantically related words,
so the TPNL effect should be smaller with unrelated word lists than
moderately related word lists. Although this prediction has not yet been
tested empirically, a recent meta-analysis showed that, indeed, studies
that used related words tended to show a greater TPNL effect than
studies that employed unrelated words (Chan et al., in preparation).
In contrast to unrelated word lists, text passages and videos are
typically written/produced in a coherent manner, which should natu-
rally invite relational processing, so any relational processing ad-
vantage induced by prior testing is likely to be modest relative to
baseline (Einstein, McDaniel, Bowers, & Stevens, 1984; Einstein,
McDaniel, Owen, & Cote, 1990; Masson & McDaniel, 1981). A version
of the strategy change account that is not tied strictly to relational
processing, however, may provide a reasonable explanation for the
TPNL effect with text passages and videos. In a broader sense, the
strategy change account specifies that performing retrieval practice
allows participants to discover the type of learning needed to ensure
satisfactory performance (or conversely, to realize the type of learning
that is inadequate to produce satisfactory performance, if participants
are performing poorly during retrieval practice), and participants can
then adjust their subsequent encoding strategy accordingly. If we take
this broader approach to strategy change, then this account can explain
the TPNL effect with prose/video materials. However, we realize that
the idea that “retrieval practice can improve later encoding strategies”
is perhaps vaguely defined. In fact, such a broad definition of strategy
change may render the account difficult to falsify. With this in mind, we
believe that the strategy change account, as we currently conceive,
should only be applied to explain the TPNL effect with word list type
materials, for which advantageous encoding strategies can be more
precisely defined (but see Jing et al., 2016 in which interspersed testing
improved conceptual integration of materials across sections of a video
lecture). In our opinion, application of this account to prose/video
material should only be done when one clearly outlines what is con-
sidered an advantageous encoding strategy so that the hypothesis can
be adequately tested.
Concluding remarks
Effective learning often requires learners to sustain their attention
for prolonged periods of time – a difficult proposition (Smallwood,
Fishman, & Schooler, 2007; Smallwood & Schooler, 2015). Recent re-
search, however, has pointed to the possibility that inserting retrieval
practice into an encoding task can reduce inattention and potentiate
learning (Szpunar et al., 2013, for a recent review, see Szpunar, 2017).
In the present experiments, we demonstrated that testing does not en-
hance subsequent learning simply because it provides a break from the
encoding activities. Instead, performing retrieval practice changes how
learners approach new, to-be-learned information, and this benefit of
retrieval on new learning persists over a moderate retention interval.
From a theoretical perspective, these results help shed light on the
mechanisms that might be responsible for test-potentiated new
learning; from a practical perspective, the present findings add to a
growing literature of the multi-faceted benefits of retrieval practice on
student learning.
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
94
Appendix A
A screenshot of sample brain teaser questions used in Experiments 1–4.
B. Supplementary material
Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.jml.2018.05.007.
References
Abel, M., & Bauml, K. H. (2016). Retrieval practice can eliminate list method directed
forgetting. Memory & Cognition, 44(1), 15–23. http://dx.doi.org/10.3758/s13421-
015-0539-x.
Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: A
meta-analysis of practice testing. Review of Educational Research. http://dx.doi.org/
10.3102/0034654316689306.
Allen, G., & Arbak, C. J. (1976). The priority effect in the A-B, A-C paradigm and subjects’
expectations. Journal of Verbal Learning and Verbal Behavior, 15, 381–385.
Ariga, A., & Lleras, A. (2011). Brief and rare mental “breaks” keep you focused:
Deactivation and reactivation of task goals preempt vigilance decrements. Cognition,
118(3), 439–443. http://dx.doi.org/10.1016/j.cognition.2010.12.007.
Arkes, H. R., & Lyons, D. J. (1979). A mediational explanation of the priority effect.
Journal of Verbal Learning and Verbal Behavior, 18, 721–731.
Aslan, A., & Bauml, K. H. (2015). Testing enhances subsequent learning in older but not in
younger elementary school children. Developmental Science. http://dx.doi.org/10.
1111/desc.12340 [in press].
Bauml, K. H., & Kliegl, O. (2013). The critical role of retrieval processes in release from
proactive interference. Journal of Memory and Language, 68(1), 39–53.
Bunce, D. M., Flens, E. A., & Neiles, K. Y. (2010). How long can students pay attention in
class? A study of student attention decline using clickers. Journal of Chemical
Education, 87(12), 1438–1443. http://dx.doi.org/10.1021/ed100409p.
Carpenter, S. K. (2011). Semantic information activated during retrieval contributes to
later retention: Support for the mediator effectiveness hypothesis of the testing effect.
Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(6),
1547–1552.
Carpenter, S. K., Pashler, H., Wixted, J. T., & Vul, E. (2008). The effects of tests on
learning and forgetting. Memory & Cognition, 36, 438–448.
Centre for Teaching Excellence – University of Waterloo. (2012, November 7). From
presenting to lecturing: adapting material for classroom delivery. Retrieved October
11, 2016, from https://uwaterloo.ca/centre-for-teaching-excellence/teaching-
resources/teaching-tips/lecturing-and-presenting/delivery/adapting-material-
classroom-delivery.
Chan, J. C. K. (2010). Long-term effects of testing on the recall of nontested materials.
Memory, 18(1), 49–57. http://dx.doi.org/10.1080/09658210903405737.
Chan, J. C. K., Meissner, C. A., & Davis, S. D. (2018). Test-potentiated (new) learning: A
meta-analytic review [in preparation].
Chan, J. C. K., & LaPaglia, J. A. (2011). The dark side of testing memory: repeated re-
trieval can enhance eyewitness suggestibility. Journal of Experimental Psychology:
Applied, 17(4), 418–432. http://dx.doi.org/10.1037/a0025147.
Chan, J. C. K., Manley, K. D., & Lang, K. (2017). Retrieval-enhanced suggestibility: A
retrospective and a new investigation. Journal of Applied Research in Memory and
Cognition, 6, 213–229. http://dx.doi.org/10.1016/j.jarmac.2017.07.003.
Chan, J. C. K., & McDermott, K. B. (2007). The testing effect in recognition memory: a
dual process account. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 33(2), 431–437. http://dx.doi.org/10.1037/0278-7393.33.2.431.
Chan, J. C. K., McDermott, K. B., & Roediger, H. L. (2006). Retrieval-induced facilitation:
initially nontested material can benefit from prior testing of related material. Journal
of Experimental Psychology: General, 135(4), 553–571. http://dx.doi.org/10.1037/
0096-3445.135.4.553.
Chan, J. C. K., Thomas, A. K., & Bulevich, J. B. (2009). Recalling a witnessed event in-
creases eyewitness suggestibility: The reversed testing effect. Psychological Science,
20(1), 66–73.
Cho, K. W., Neely, J. H., Crocco, S., & Vitrano, D. (2017). Testing enhances both encoding
and retrieval for both tested and untested items. The Quarterly Journal of Experimental
Psychology, 70, 1211–1235. http://dx.doi.org/10.1080/17470218.2016.1175485.
Davis, S. D., & Chan, J. C. K. (2015). Studying on borrowed time: How does testing impair
new learning? Journal of Experimental Psychology: Learning, Memory, and Cognition,
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
95
http://dx.doi.org/10.1016/j.jml.2018.05.007
http://dx.doi.org/10.3758/s13421-015-0539-x
http://dx.doi.org/10.3758/s13421-015-0539-x
http://dx.doi.org/10.3102/0034654316689306
http://dx.doi.org/10.3102/0034654316689306
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0015
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0015
http://dx.doi.org/10.1016/j.cognition.2010.12.007
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0025
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0025
http://dx.doi.org/10.1111/desc.12340
http://dx.doi.org/10.1111/desc.12340
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0035
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0035
http://dx.doi.org/10.1021/ed100409p
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0050
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0050
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0050
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0050
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0055
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0055
https://uwaterloo.ca/centre-for-teaching-excellence/teaching-resources/teaching-tips/lecturing-and-presenting/delivery/adapting-material-classroom-delivery
https://uwaterloo.ca/centre-for-teaching-excellence/teaching-resources/teaching-tips/lecturing-and-presenting/delivery/adapting-material-classroom-delivery
https://uwaterloo.ca/centre-for-teaching-excellence/teaching-resources/teaching-tips/lecturing-and-presenting/delivery/adapting-material-classroom-delivery
http://dx.doi.org/10.1080/09658210903405737
http://dx.doi.org/10.1037/a0025147
http://dx.doi.org/10.1016/j.jarmac.2017.07.003
http://dx.doi.org/10.1037/0278-7393.33.2.431
http://dx.doi.org/10.1037/0096-3445.135.4.553
http://dx.doi.org/10.1037/0096-3445.135.4.553
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0095
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0095
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0095
http://dx.doi.org/10.1080/17470218.2016.1175485
41(6), 1741–1754. http://dx.doi.org/10.1037/a0032377.
Davis, S. D., Chan, J. C. K., & Wilford, M. M. (2017). The dark side of interpolated testing:
Frequent switching between retrieval and encoding impairs new learning. Journal of
Applied Research in Memory and Cognition. http://dx.doi.org/10.1016/j.jarmac.2017.
07.002 [in press].
Einstein, G. O., McDaniel, M. A., Bowers, C. A., & Stevens, D. T. (1984). Memory for
prose: The influence of relational and proposition-specific processing. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 10(1), 133–143.
Einstein, G. O., McDaniel, M. A., Owen, P. D., & Cote, N. C. (1990). Encoding and recall of
texts: The importance of material appropriate processing. Journal of Memory and
Language, 29, 566–581.
Finn, B., & Roediger, H. L. (2013). Interfering effects of retrieval in learning new in-
formation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39(6),
1665–1681. http://dx.doi.org/10.1037/a0032377.
Gordon, L. T., & Thomas, A. K. (2014). Testing potentiates new learning in the mis-
information paradigm. Memory & Cognition, 42(2), 186–197. http://dx.doi.org/10.
3758/s13421-013-0361-2.
Gordon, L. T., & Thomas, A. K. (2017). The forward effects of testing on eyewitness
memory: The tension between suggestibility and learning. Journal of Memory and
Language, 95, 190–199. http://dx.doi.org/10.1016/j.jml.2017.04.004.
Gordon, L. T., Thomas, A. K., & Bulevich, J. B. (2015). Looking for answers in all the
wrong places: How testing facilitates learning of misinformation. Journal of Memory
and Language, 83, 140–151.
Gunter, B. (1980). Release from proactive interference with television news items:
Evidence for encoding dimensions within televised news. Journal of Experimental
Psychology: Human Learning and Memory, 6(2), 216–223. http://dx.doi.org/10.1037/
0278-7393.6.2.216.
Hintzman, D. L. (2004). Judgment of frequency versus recognition confidence: Repetition
and recursive reminding. Memory & Cognition, 32, 336–350.
Hintzman, D. L. (2009). How does repetition affect memory? Evidence from judgments of
recency. Memory & Cognition, 38(1), 102–115. http://dx.doi.org/10.3758/MC.38.1.
102.
Hunt, R. R., & McDaniel, M. A. (1993). The enigma of organization and distinctiveness.
Journal of Memory and Language, 32, 421–445.
Jacoby, L. L., Wahlheim, C. N., & Kelley, C. M. (2015). Memory consequences of looking
back to notice change: Retroactive and proactive facilitation. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 41(5), 1282–1297. http://dx.doi.org/10.
1037/xlm0000123.
Jang, Y., & Huber, D. E. (2008). Context retrieval and context change in free recall:
Recalling from long-term memory drives list isolation. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 34, 112–127.
Jing, H. G., Szpunar, K. K., & Schacter, D. L. (2016). Interpolated testing influences fo-
cused attention and improves integration of information during a video-recorded
lecture. Journal of Experimental Psychology: Applied, 22(3), 305–318. http://dx.doi.
org/10.1037/xap0000087.
Jonker, T. R., Seli, P., & MacLeod, C. M. (2013). Putting retrieval-induced forgetting in
context: An inhibition-free, context-based account. Psychological Review, 120(4),
852–872. http://dx.doi.org/10.1037/a0034246.
Klein, K. A., Shiffrin, R. M., & Criss, A. H. (2011). Putting context in context. In The
foundations of remembering: essays in honor of Henry L. Roediger, III (pp. 171–190).
New York: Routledge. http://dx.doi.org/10.4324/9780203837672.
Masson, M. E., & McDaniel, M. A. (1981). The role of organizational processes in long-
term retention. Journal of Experimental Psychology: Human Learning and Memory, 7(2),
100.
McDaniel, M. A., & Einstein, G. O. (1989). Material-appropriate processing: A con-
textualist approach to reading and studying strategies. Educational Psychology Review,
1, 113–145.
McDaniel, M. A., Einstein, G. O., & Waddill, P. J. (1990). Material-appropriate processing:
Implications for remediating recall deficits in students with learning disabilities.
Learning Disability Quarterly, 13, 258–268.
McDaniel, M. A., Roediger, H. L., & McDermott, K. B. (2007). Generalizing test-enhanced
learning from the laboratory to the classroom. Psychonomic Bulletin & Review, 14,
200–206.
Nunes, L. D., & Weinstein, Y. (2012). Testing improves true recall and protects against the
build-up of proactive interference without increasing false recall. Memory, 20(2),
138–154. http://dx.doi.org/10.1080/09658211.2011.648198.
Olmsted, J. A. (1999). The mid-lecture break: When less is more. Journal of Chemical
Education, 76(2–4), 525–527.
Pastotter, B., & Bauml, K. H. (2014). Retrieval practice enhances new learning: the for-
ward effect of testing. Frontiers in Psychology, 5, 1–5.
Pastotter, B., Schicker, S., Niedernhuber, J., & Bauml, K. H. (2011). Retrieval during
learning facilitates subsequent memory encoding. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 37, 287–297.
Pastotter, B., Weber, J., & Bauml, K. H. (2013). Using testing to improve learning after
severe traumatic brain injury. Neuropsychology, 27(2), 280–285. http://dx.doi.org/
10.1037/a0031797.
Pierce, B. H., Gallo, D. A., & McCain, J. L. (2017). Reduced interference from memory
testing: a postretrieval monitoring account. Journal of Experimental Psychology:
Learning, Memory, and Cognition. http://dx.doi.org/10.1037/xlm0000377 [in
press].
Pyc, M. A., & Rawson, K. A. (2010). Why testing improves memory: Mediator
effectiveness hypothesis. Science, 330(6002), 335. http://dx.doi.org/10.1126/
science.1191465 335–335.
Risko, E. F., Anderson, N., Sarwal, A., Engelhardt, M., & Kingstone, A. (2012). Everyday
attention: variation in mind wandering and memory in a lecture. Applied Cognitive
Psychology, 26(2), 234–242. http://dx.doi.org/10.1002/acp.1814.
Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests
improves long-term retention. Psychological Science, 17, 249–255.
Roediger, H. L., & McDermott, K. B. (1995). Creating false memories: Remembering
words not presented in lists. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 21(4), 803–814.
Roenker, D. L., Thompson, C. P., & Brown, S. C. (1971). Comparison of measures for the
estimation of clustering in free recall. Psychological Bulletin, 76, 45–48.
Rowland, C. A. (2014). The effect of testing versus restudy on retention: a meta-analytic
review of the testing effect. Psychological Bulletin, 140(6), 1432–1463. http://dx.doi.
org/10.1037/a0037559.
Sahakyan, L., & Hendricks, H. E. (2012). Context change and retrieval difficulty in the
list-before-last paradigm. Memory & Cognition, 40(6), 844–860. http://dx.doi.org/10.
3758/s13421-012-0198-0.
Sahakyan, L., & Kelley, C. M. (2002). A contextual change account of the directed for-
getting effect. Journal of Experimental Psychology: Learning, Memory, and Cognition,
28(6), 1064–1072. http://dx.doi.org/10.1037/0278-7393.28.6.1064.
Smallwood, J., Fishman, D. J., & Schooler, J. W. (2007). Counting the cost of an absent
mind: Mind wandering as an underrecognized influence on educational performance.
Psychonomic Bulletin and Review, 14, 230–236. http://dx.doi.org/10.3758/
BF03194057.
Smallwood, J., & Schooler, J. W. (2015). The science of mind wandering: Empirically
navigating the stream of consciousness. Annual Review of Psychology, 66(1), 487–518.
http://dx.doi.org/10.1146/annurev-psych-010814-015331.
Soderstrom, N. C., & Bjork, R. A. (2014). Testing facilitates the regulation of subsequent
study time. Journal of Memory and Language, 73, 99–115. http://dx.doi.org/10.1016/
j.jml.2014.03.003.
Szpunar, K. (2017). Directing the wandering mind. Current Directions in Psychological
Science, 26(1), 40–44. http://dx.doi.org/10.1177/0963721416670320.
Szpunar, K. K., Jing, H. G., & Schacter, D. L. (2014). Overcoming overconfidence in
learning from video-recorded lectures: Implications of interpolated testing for online
education. Journal of Applied Research in Memory and Cognition, 3(3), 161–164.
http://dx.doi.org/10.1016/j.jarmac.2014.02.001.
Szpunar, K. K., Khan, N. Y., & Schacter, D. L. (2013). Interpolated memory tests reduce
mind wandering and improve learning of online lectures. Proceedings of the National
Academy of Sciences of the United States of America, 110, 6313–6317. http://dx.doi.
org/10.1073/pnas.1221764110/-/DCSupplemental/pnas.201221764SI .
Szpunar, K. K., McDermott, K. B., & Roediger, H. L. (2008). Testing during study insulates
against the buildup of proactive interference. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 34(6), 1392–1399. http://dx.doi.org/10.1037/
a0013082.
Tulving, E., & Watkins, M. J. (1974). On negative transfer: Effects of testing one list on the
recall of another. Journal of Verbal Learning and Verbal Behavior, 13, 181–193.
Van Overschelde, J. P., Rawson, K. A., & Dunlosky, J. (2004). Category norms: An up-
dated and expanded version of the Battig and Montague (1969) norms. Journal of
Memory and Language, 50(3), 289–335. http://dx.doi.org/10.1016/j.jml.2003.10.
003.
Wahlheim, C. N. (2015). Testing can counteract proactive interference by integrating
competing information. Memory & Cognition, 43(1), 27–38. http://dx.doi.org/10.
3758/s13421-014-0455-5.
Weinstein, Y., Gilmore, A. W., Szpunar, K. K., & McDermott, K. B. (2014). The role of test
expectancy in the build-up of proactive interference in long-term memory. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 40(4), 1039–1048. http://
dx.doi.org/10.1037/a0036164.
Weinstein, Y., McDermott, K. B., & Szpunar, K. K. (2011). Testing protects against
proactive interference in face–name learning. Psychonomic Bulletin & Review, 18(3),
518–523. http://dx.doi.org/10.3758/s13423-011-0085-x.
Whiffen, J. W., & Karpicke, J. D. (2017). The role of episodic context in retrieval practice
effects. Journal of Experimental Psychology: Learning, Memory, and Cognition, 43,
1036–1046. http://dx.doi.org/10.1037/xlm0000379.
Wickens, D. D. (1970). Encoding categories of words: An empirical approach to meaning.
Psychological Review, 77(1), 1–15. http://dx.doi.org/10.1037/h0028569.
Wissman, K. T., & Rawson, K. A. (2015). Grain size of recall practice for lengthy text
material: fragile and mysterious effects on memory. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 41(2), 439–455. http://dx.doi.org/10.
1037/xlm0000047.
Wissman, K. T., Rawson, K. A., & Pyc, M. A. (2011). The interim test effect: Testing prior
material can facilitate the learning of new material. Psychonomic Bulletin & Review,
18(6), 1140–1147. http://dx.doi.org/10.3758/s13423-011-0140-7.
Yang, C., Potts, R., & Shanks, D. R. (2017). The forward testing effect on self-regulated
study time allocation and metamemory monitoring. Journal of Experimental
Psychology: Applied, 1–17. http://dx.doi.org/10.1037/xap0000122.
Yang, C., Potts, R., & Shanks, D. R. (2018). Enhancing learning and retrieval of new
information: A review of the forward testing effect. Npj Science of Learning, 3(1), 8.
Zaromb, F. M., & Roediger, H. L. (2010). The testing effect in free recall is associated with
enhanced organizational processes. Memory & Cognition, 38(8), 995–1008. http://dx.
doi.org/10.3758/MC.38.8.995.
J.C.K. Chan et al. Journal of Memory and Language 102 (2018) 83–96
96
http://dx.doi.org/10.1037/a0032377
http://dx.doi.org/10.1016/j.jarmac.2017.07.002
http://dx.doi.org/10.1016/j.jarmac.2017.07.002
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0120
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0120
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0120
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0125
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0125
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0125
http://dx.doi.org/10.1037/a0032377
http://dx.doi.org/10.3758/s13421-013-0361-2
http://dx.doi.org/10.3758/s13421-013-0361-2
http://dx.doi.org/10.1016/j.jml.2017.04.004
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0145
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0145
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0145
http://dx.doi.org/10.1037/0278-7393.6.2.216
http://dx.doi.org/10.1037/0278-7393.6.2.216
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0155
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0155
http://dx.doi.org/10.3758/MC.38.1.102
http://dx.doi.org/10.3758/MC.38.1.102
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0165
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0165
http://dx.doi.org/10.1037/xlm0000123
http://dx.doi.org/10.1037/xlm0000123
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0175
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0175
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0175
http://dx.doi.org/10.1037/xap0000087
http://dx.doi.org/10.1037/xap0000087
http://dx.doi.org/10.1037/a0034246
http://dx.doi.org/10.4324/9780203837672
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0195
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0195
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0195
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0200
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0200
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0200
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0205
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0205
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0205
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0210
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0210
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0210
http://dx.doi.org/10.1080/09658211.2011.648198
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0225
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0225
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0230
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0230
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0235
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0235
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0235
http://dx.doi.org/10.1037/a0031797
http://dx.doi.org/10.1037/a0031797
http://dx.doi.org/10.1037/xlm0000377
http://dx.doi.org/10.1126/science.1191465
http://dx.doi.org/10.1126/science.1191465
http://dx.doi.org/10.1002/acp.1814
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0260
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0260
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0265
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0265
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0265
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0270
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0270
http://dx.doi.org/10.1037/a0037559
http://dx.doi.org/10.1037/a0037559
http://dx.doi.org/10.3758/s13421-012-0198-0
http://dx.doi.org/10.3758/s13421-012-0198-0
http://dx.doi.org/10.1037/0278-7393.28.6.1064
http://dx.doi.org/10.3758/BF03194057
http://dx.doi.org/10.3758/BF03194057
http://dx.doi.org/10.1146/annurev-psych-010814-015331
http://dx.doi.org/10.1016/j.jml.2014.03.003
http://dx.doi.org/10.1016/j.jml.2014.03.003
http://dx.doi.org/10.1177/0963721416670320
http://dx.doi.org/10.1016/j.jarmac.2014.02.001
http://dx.doi.org/10.1073/pnas.1221764110/-/DCSupplemental/pnas.201221764SI
http://dx.doi.org/10.1073/pnas.1221764110/-/DCSupplemental/pnas.201221764SI
http://dx.doi.org/10.1037/a0013082
http://dx.doi.org/10.1037/a0013082
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0325
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0325
http://dx.doi.org/10.1016/j.jml.2003.10.003
http://dx.doi.org/10.1016/j.jml.2003.10.003
http://dx.doi.org/10.3758/s13421-014-0455-5
http://dx.doi.org/10.3758/s13421-014-0455-5
http://dx.doi.org/10.1037/a0036164
http://dx.doi.org/10.1037/a0036164
http://dx.doi.org/10.3758/s13423-011-0085-x
http://dx.doi.org/10.1037/xlm0000379
http://dx.doi.org/10.1037/h0028569
http://dx.doi.org/10.1037/xlm0000047
http://dx.doi.org/10.1037/xlm0000047
http://dx.doi.org/10.3758/s13423-011-0140-7
http://dx.doi.org/10.1037/xap0000122
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0385
http://refhub.elsevier.com/S0749-596X(18)30050-0/h0385
http://dx.doi.org/10.3758/MC.38.8.995
http://dx.doi.org/10.3758/MC.38.8.995
- Testing potentiates new learning across a retention interval and a lag: A strategy change perspective
Introduction
Retention interval
Lag
A strategy change perspective of test-potentiated new learning
Experiment 1
Method
Design and participants
Materials and procedure
Results and discussion
List 4 recall
Correct recall
Clustering in recall
Intrusions
Recall across lists
Experiment 2
Method
Participants, design, materials, and procedure
Results and discussion
List 4 recall
Correct recall
Clustering in recall
Intrusions
Recall across lists
Experiment 3
Method
Participants, design, materials, and procedure
Results and discussion
List 4 recall
Correct recall
Clustering in recall
Intrusions
Recall across lists
Experiment 4
Method
Participants, design, materials, and procedure
Results and discussion
List 4 recall
Correct recall
Clustering in recall
Intrusions
Recall across lists
General discussion
The persistence of test-potentiated new learning
Explaining testing-potentiated new learning
Concluding remarks
Appendix A
Supplementary material
References