Assignment and Articles from this week lesson are attached. You will keep the same topic the company Netflix as discussed in a earlier assignment last week. Please follow Assignment instructions that’s attached.
Assignment:
Final Project is on Netflix but you will incorporate Netflix company in this week learning. I will attach a couple articles on things learned this week, and you will answer questions below about this Assignment. Assignment should be 1.5-3 pages.
***This week you will continue to work on your Final Project, you will submit a business memo, written to your Instructor, that explains how you plan to incorporate your learning from the week into your Final Project. This will not be a “perfect” synopsis at this point, but it should capture the main themes and important ideas from the week. Your memo should include the following:
· Your preliminary summary of how you are planning to incorporate this week’s learning into your Final Project
· Your ideas and recommendations for how your organization can mitigate the short-term and long-term issues that can arise when selecting and using data resources and systems
· Brief descriptions of the types of data resources, data processing, and storage systems chosen
· An explanation of how the organization might manage the potential implications of those selections, while taking advantage of the opportunities they afford to sustain the business or gain a competitive advantage
· Other relevant recommendations or issues that you identified, with a brief analysis of why they are important
Note:
If you are unable to find relevant information, you may want to look for similar information at /for other similar publically traded companies. You may find relevant information that will enable you to make appropriate inferences about your organization and make reasonable assumptions so you can proceed with your project.
If you have questions about how to apply what you are learning or how to find the most relevant information for your organization’s needs, please discuss your choice with your Instructor using the Contact the Instructor link in the classroom.
General Guidance on Assignment Length: Your Week 3 Assignment will typically be 1.5–3 pages (0.75–1 page(s) if single spaced), excluding a title page (not required for this Assignment) and references.
As a best practice for this course, as you engage and complete your weekly activities, you will capture notes on your learning for each week and add them to your Final Project Portfolio, which you will use to formulate your Final Project. By the time you turn in your Final Project in Week 7, your working Final Project Portfolio document will contain your notes on the resources you found relevant in each week, as well as your ideas for how your organization can improve its business information systems management with respect to each week’s main themes. You should also include your analyses of the implications of your recommendations, as well as reasons why your recommendations are important. You will update your Final Project Portfolio each week.
This week you are encouraged to begin setting up your presentation in PowerPoint. You should complete a title slide and 1–2 slides with your preliminary recommendations for your organization as they relate to this week’s learning.
Note:
Neither your Final Project Portfolio nor your preliminary slides will be submitted for grading this week.
Final Project is on Netflix but you will incorporate Netflix company in this week learning. I will post a couple articles below on things learned this week, and you will answer questions above about this Assignment.
244
Article
Management of a Large Qualitative Data Set: Establishing Trustworthiness of the Data
Debbie Elizabeth White, RN, PhD
Associate Professor, Associate Dean of Research
Faculty of Nursing
University of Calgary
Calgary, Alberta, Canada
Nelly D. Oelke, RN, PhD
Assistant Professor
School of Nursing
Faculty of Health and Social Development
University of British Columbia, Okanagan Campus
Kelowna, British Columbia, Canada
Steven Friesen, BSc
Quality Practice Leader
Bethany Care Society
Calgary, Alberta, Canada
© 2012 White, Oelke, and Friesen.
Abstract
Health services research is multifaceted and impacted by the multiple contexts and
stakeholders involved. Hence, large data sets are necessary to fully understand the complex
phenomena (e.g., scope of nursing practice) being studied. The management of these large
data sets can lead to numerous challenges in establishing trustworthiness of the study. This
article reports on strategies utilized in data collection and analysis of a large qualitative study
to establish trustworthiness. Specific strategies undertaken by the research team included
training of interviewers and coders, variation in participant recruitment, consistency in data
collection, completion of data cleaning, development of a conceptual framework for analysis,
consistency in coding through regular communication and meetings between coders and key
research team members, use of N6
TM
software to organize data, and creation of a
comprehensive audit trail with internal and external audits. Finally, we make eight
recommendations that will help ensure rigour for studies with large qualitative data sets:
organization of the study by a single person; thorough documentation of the data collection
and analysis process; attention to timelines; the use of an iterative process for data collection
and analysis; internal and external audits; regular communication among the research team;
adequate resources for timely completion; and time for reflection and diversion. Following
these steps will enable researchers to complete a rigorous, qualitative research study when
faced with large data sets to answer complex health services research questions.
245
Keywords: qualitative research, data management, large data sets, rigour, nursing, scope of
practice
Acknowledgements: We want to thank all the participants for their valuable contributions to
this study. In addition, we would like to thank our funders: Canadian Health Services
Research Foundation, Calgary Health Region (now Alberta Health Services, Calgary),
Capital Health Region (now Alberta Health Services, Edmonton), Saskatoon Health Region,
and the University of Calgary, Faculty of Nursing.
International Journal of Qualitative Methods 2012, 11(3)
246
Introduction
Healthcare systems are very complex, multi component systems that are continually evolving.
Given the complexity of these systems, health services research is multifaceted and impacted by
the multiple contexts and stakeholders involved. Even when researchers study only a particular
component of the healthcare system (e.g., scope of nursing practice in acute care), multiple
contexts are encountered and many participants are included to better understand the complex
phenomena being studied. Health services research on scope of practice is not well established
and, given the research questions, qualitative research is often the focus of such exploratory
research. Because data collection may occur across a number of sites by more than one research
assistant, research teams encounter logistical issues and have difficulties maintaining
predetermined timelines. The end result can be a very large amount of qualitative data that must
be analyzed and interpreted appropriately to ensure that an accurate synopsis of the results will be
presented. Organization of data and attention to rigour are essential when working with such large
qualitative data sets. This article describes the management of a large qualitative data set
generated from the research study entitled “A Systematic Approach to Maximizing Nurses’ Scope
of Practice.” More specifically, the purpose of this article is to reflect upon and describe the
processes through which the research team managed a large qualitative data set to ensure that the
final product would be judged as rigorous. One of the authors, Nelly D. Oelke, has considerable
experience in qualitative research. This expertise and the application of the literature on
qualitative data analysis guided the structures and processes used to collect and analyze our data.
This article contributes to the literature about managing large qualitative data sets by providing
concrete steps for ensuring rigour in data collection, analysis, and interpretation.
Background
Trustworthiness and data management are vital to the success of qualitative studies. Although
literature on maintaining rigour in qualitative research is abundant, few articles have tackled
doing so with large qualitative data sets (Knafl & Ayres, 1996) and few researchers have
documented their process. A search of the literature was conducted and confirmed these findings.
Search terms used for our literature search included qualitative research, data management, large
data sets, and rigor, with coverage of the following databases: MEDLINE, CINAHL, and
PsycINFO. Guba (1981) and others (Johnson & Waterfield, 2004; Whittemore, Chase, & Mandle,
2001) recommend general methodologies to ensure rigour in qualitative research. Although the
descriptions of methodologies in the literature vary, most involve steps to maintain credibility,
dependability, transferability, and confirmability (Guba, 1981).
To maintain credibility (Guba, 1981) or authenticity (Whittemore et al., 2001), researchers must
adhere to methods accepted as scientifically sound in the qualitative and informational sciences.
While transparency of methodology is important, Sandelowski (1997) cautions against focusing
only on methods. Rather, researchers should maximize data utility to answer the research
questions. The researcher must have a satisfactory cultural familiarity with the participating
institution and use a comfortable approach in recruiting participants so that the sampling process
is random and unbiased (Guba, 1981). Moreover, participants’ input must be honest, clearly
recorded, and accurately presented (Whittemore et al., 2001).
Dependability and transferability are related in that both ensure all research design and operations
are clearly identified (Guba, 1981). These steps also allow for replication of the methodology
with a larger population or by future researchers. However, it is important to differentiate
between the dependability of a method in producing similar interpretations and the reliability of a
method in producing identical results. Qualitative research focuses on describing participants’
International Journal of Qualitative Methods 2012, 11(3)
247
experience as accurately as possible (Sandelowski, 1997), rather than using numbers to describe
the phenomena of interest. According to Sandelowski (1997), interpreting the results, providing
valid applications of the findings, and accumulating knowledge as a foundation for other studies
are essential for validating data from a qualitative study. Johnson and Waterfield (2004) explain:
Qualitative data are descriptive, unique to a particular context and therefore cannot be
reproduced time and again to demonstrate ‘reliability’(Bloor, 1997). Instead of trying to
control extraneous variables, qualitative research takes the view that reality is socially
constructed by each individual and should be interpreted rather than measured; that
understanding cannot be separated from context. (p. 122-123)
Whittemore et al.’s (2001) framework for enhancing rigour includes criticality and integrity
components, which for Guba (1981) and Johnson and Waterfield (2004), are included as
components of confirmability or an audit trail. It is recommended that researchers keep an
accurate, comprehensive record of the approaches and activities employed in the study, both in
data collection and analysis. This record includes highlighting shortcomings of the study in the
research report and providing transparent links between study results and actual experiences of
the participants in the study (Guba, 1981). Such audit trails not only provide a solid
methodological reference for the reader, but also provide an opportunity for reflective reasoning
(on themes or categories chosen, interpretations, etc.) and criticism for the researchers as the
study progresses (Guba, 1981; Johnson & Waterfield, 2004; Whittemore et al., 2001). For
example, if methodology changes at some point in the study, an audit trail would keep a record of
when, why, and what changes were implemented. Such audit trails become especially useful in
the management of large databases and for placing data points, methodology, and
interpretation
within the particular context in which they belong.
Knafl and Ayres (1996) offer researchers two data management steps for handling larger
qualitative data sets. First, case summaries can save researchers great time and logistical
resources while decreasing error. Core study researchers would summarize focus group or
interview transcripts to a fraction of their original length, and also include relevant data organized
into themes agreed upon beforehand as part of a summary guideline. This step not only allows
core researchers, who will be interpreting the data, to work closely with the data, but also allows
for critical insight. As a complement to the case summaries, it is recommended that researchers
tackling a large data set create matrices using database management systems. Guided by themes
and questions identified in the study, matrices provide a visual display of the data, including
extracted themes. Such matrices simplify the data for researchers’ discussions, while the case
summaries provide more details on the data. Moreover, data can be reorganized quickly using
electronic matrices, allowing for various perspectives and discussions on study outcomes (Knafl
& Ayres, 1996).
Ensuring rigour in qualitative research is a priority when collecting, presenting, and interpreting
data. Larger qualitative data sets can present a critical challenge for researchers in maintaining
study trustworthiness and, therefore, special guidelines must be strictly followed to ensure
transparency, logical reasoning, and criticality. As few sources in the literature have suggested
methodology for managing large qualitative data sets, this article aims to outline the methods
followed by our research team to maintain rigour in such circumstances.
Description of the Study
Numerous reports have highlighted the need to address the under-utilization of health human
resources by maximizing professional scopes of practice (Advisory Committee on Health Human
International Journal of Qualitative Methods 2012, 11(3)
248
Resources, 2002; Fyke, 2001). The need to clarify and define the nursing scope of practice was
recognized in Canada as well as internationally. However, there was a void in the research
literature in terms of describing scope of practice (being able to practice to the full extent of one’s
education, knowledge, and experience) and examining barriers to the enactment of full scope of
practice. This research study was unique in that it examined scope and boundaries in the practice
of various categories of nursing personnel simultaneously, for example, registered nurses (RNs),
registered psychiatric nurses (RPNs), and licensed practical nurses (LPNs). The overall goal of
this research was to make rich and robust conclusions about the scope of practice of nurses, the
barriers to and facilitators of scope, and the impact of contextual factors on scope of practice.
Research findings have been reported elsewhere (Oelke, White, Besner, Doran, McGillis-Hall, &
Giovannetti, 2008; White, Oelke, Besner, Doran, McGillis-Hall, & Giovannetti, 2008).
Study Methodology
This research study used a descriptive exploratory design with mixed methods (Creswall, 2009)
to explain enactment of scope of practice among all categories of regulated nurses (e.g., RNs,
LPNs, RPNs). Both quantitative (e.g., questionnaires) and qualitative (e.g., interviews with a
variety of stakeholders) data were collected, and one informed the other in the data analysis and
interpretation. This article focuses on the qualitative data set. To make our experience of
managing this large qualitative data set truly transparent, underlying foundational components of
the study will be discussed. According to Sandelowski’s (2000) classifications, our study was a
qualitative descriptive study. The methodological underpinnings of this study were eclectic. The
qualitative component drew on tenets (e.g., importance of the setting and context, purposive
sampling, and inductive analysis) situated within the naturalistic paradigm espoused by Lincoln
and Guba (1985) and Miles and Huberman (1994). A subcomponent of the study (quantitative
data from surveys and regional corporate databases) was positioned within a positivist paradigm.
This latter component will not be addressed in this article.
Research questions focused on nurses’ and other healthcare providers’ perceptions of nurses
working to their full scope of practice. Participants were also asked to identify personal,
professional, and organizational barriers or facilitators that enabled or hindered their ability to
work to their full scope of practice. These types of questions are typically associated with
qualitative descriptive studies (Sandelowski, 2000). Data were collected on 14 acute care nursing
units located within three western Canadian Health Regions. To ensure variability in sampling,
patient care units from hospitals of various intensities and representing variability in provider and
organizational characteristics were selected across the health regions to participate in the study.
Types of units included in the study were intensive care, medicine, surgery, and psychiatry.
Individual, face-to-face, semi-structured interviews were conducted to gather information on
enactment of nursing roles and perceived facilitators and barriers to maximizing scope of
practice. A purposive volunteer sample of nursing personnel (e.g., RNs, LPNs, RPNs, and Patient
Care Managers) and inter-professional healthcare team members were recruited on the study
units. Patient interviews were also conducted with a small sample of volunteer patients from each
health region to validate the extent to which patient experience reflected the expected focus of
nursing defined in scope of practice documents. A total of 236 interviews were audio-recorded
and transcribed. There were 167 interviews conducted with nursing staff: 85 RNs, 31 LPNs, 11
RPNs, 19 patient care managers and assistant patient care managers, and 21 nurses in specialized
roles (e.g., nurse educators and nurse clinicians). The remainder of the interviews were completed
with other healthcare providers (e.g., physicians, social workers, and physiotherapists) and
patients. The establishment of rigour in this study became a daunting task when a large number of
nurses and other healthcare providers were interviewed to discuss enactment of scope of practice
International Journal of Qualitative Methods 2012, 11(3)
249
of nurses and the influence of the work environment and other structures and processes on role
enactment.
Ensuring Trustworthiness of the Data
Our study presented a myriad of challenges to ensure trustworthiness of the data. These
challenges included collecting data from multiple sites using different research assistants,
developing a process for analyzing the data, recruiting and retaining qualified individuals to
complete coding and initial analysis, and completing in-depth analysis of the data. Despite the
many challenges encountered while managing this large qualitative data set, we succeeded in
reporting research results that met the criteria for data trustworthiness. The following sections of
this article will outline the challenges and how they were addressed in handling the large amount
of data collected in this study.
Credibility or Authenticity
First, participant recruitment was an important aspect to ensure credibility of research results.
Participant selection required the incorporation of multiple perspectives (e.g., RNs, LPNs,
managers, and interdisciplinary team members) to provide a clear and broad understanding of
nurses’ and other healthcare providers’ perceptions of nursing scope of practice. Although
participants volunteered to be interviewed, researchers facilitated a diverse range of perspectives
by presenting in both posters and unit presentations the importance of capturing varied perception
of scope of practice. Maximum variation (Polit & Hungler, 1999) was sought when recruiting
units for the study. Variability in organizational characteristics, hospital intensity, and a variety of
patient care units was desired (Aita & McIlvain, 1999; Morse & Richards, 2002). Once hospital
units were selected, participants were recruited for interviews with attention to maximum
variability in representing the various perspectives of nursing scope of practice. Given the
geographical nature of the study, as well as the nature of the clinical environment, it was not
possible to limit interviews to one unit with one group of providers at a single point in time.
Rather, interviews were completed on various patient care units with a variety of providers in the
same time frame, depending on their work schedules. Initially, both research assistants and
several research team members were concerned that this process would result in more interviews
being conducted because of the uncertainty about data saturation. However, bi-weekly
discussions with interviewers confirmed that while they were hearing similar elements, data
saturation had not been reached. The variability in units, desired variability in participants, and
lack of data saturation until later in the study all led to the recruitment of a very large sample of
participants for the study, which created a situation requiring the management of a large
qualitative data set.
Second, the consistency of data collection was important to ensure credibility of research results.
Data collection was conducted across three health regions with multiple research assistants. Semi-
structured interview questions focused on components of the overarching research questions and
guided the interview process with nurses and interdisciplinary team members (Sandelowski,
2000). These interviews were conducted by three different research assistants. To standardize
interview tracking and scheduling and the entry of demographic data, an Access
TM
database,
along with a user manual detailing the process and technical applications, were provided to each
study site. To further establish consistency in data collection (Aita & McIlvain, 1999; Morse &
Richards, 2002), a two day training session was conducted by the project manager with the
research assistants to discuss the interview protocol, review interview questions and cues, address
the concept of data saturation, and discuss the purpose of writing brief field notes following
interviews. A second component of the training session was the completion and review of several
International Journal of Qualitative Methods 2012, 11(3)
250
interviews with each of the research assistants. Feedback about interviewing techniques (e.g.,
paraphrasing, clarity, utilization of cues for questions, and getting the interviewee to elaborate on
their responses) was provided to each of the research assistants. At this time, research assistants
also shared their experiences in completing the interview and provided valuable feedback
contributing to the clarity of the interview questions and additional prompts for the questions. A
training manual was also provided as a resource for the interviewers to complement training
sessions. Given the magnitude of the study, research assistants were reminded to limit the number
of interviews (3-4) completed in one day to avoid interviewer burden. Despite the focus on
consistency of data collection, opportunity was provided through the field notes to co-author
results as noted in Kvale (1996).
Finally, presentations were made to participants to ensure the credibility and authenticity of the
research results. Ten to fifteen presentations were made to various groups of participants (e.g.,
nurses, allied healthcare professionals, Patient Care Managers, senior health leaders, and Chief
Nursing Officers) wherein results were validated by participants.
Dependability and Transferability
Data Processing and Cleaning
Consistency between transcribers was ensured by utilizing a consistent template, which permitted
easy transfer of documents into N6
TM
,
a computer program designed for qualitative data storage,
indexing, and theorizing. Transcription guidelines to standardize expressions and formatting
were
provided to each transcriptionist. To ensure the accuracy of the transcription, each of the research
assistants reviewed the transcripts while comparing them to audio files. Minor discrepancies, such
as spelling errors and clarification of acronyms, were made in the transcripts following review by
the research assistants.
Data Analysis
Preparation for the data analysis component of the study was intense, time consuming, and multi-
focused. A phased approach (Gaskell & Bauer, 2000) was used in data analysis, with the
completion of coding and initial analysis prior to in-depth analysis of the data. A conceptual
framework was developed by the research team to begin a content analysis process. As an initial
step in the development of the framework, two research team members and one research assistant
independently reviewed four interviews to identify preliminary themes. The second step in the
development of the framework was to have two of the original research team members and two
new research team members independently analyze four new interviews utilizing the existing
framework. With the analysis of this second set of interviews, consistencies were found in the
themes identified between the team members. However, new themes also emerged resulting in an
expansion of the conceptual framework. Miller and Crabtree (1992) have described this approach
to content analysis as a “template” style (p.18). The conceptual framework served as the initial
tree node structure to begin coding interviews in the
N6
TM
software program.
As consistent with qualitative methods, the categories and nodes identified were not considered
static. Several iterations of this conceptual framework and tree node structure evolved,
particularly in the early stages of coding and analysis by the coders and the research team
(Gaskell & Bauer, 2000). In developing the evolving tree, particular attention was paid to the
semantic relationships of the parent and child nodes. A reference document defining each node
and indicating placement in the hierarchy of the tree structure was developed and modified to
reflect coding team discussions. This process assisted the dependability of the analysis of the
International Journal of Qualitative Methods 2012, 11(3)
251
large data set, which occurred across multiple coders. Documentation of the changes and the
rationale for changes were maintained to establish an audit trail (Lincoln & Guba, 1985).
Recruiting and retaining qualified individuals to do coding and initial analysis was both an
important and difficult task (Richards, 2005). Coders were recruited through the university and
connections with other qualitative researchers. Coding was accomplished through collaboration
and strengthened by the varying perspectives of multidisciplinary team members with
backgrounds in clinical and academic nursing, psychology, social work, occupational therapy,
and health services research. Most team members had previous qualitative research experience. A
half day of training was provided to all coders with ongoing consultation and assistance provided
by various members of the research team. Binders describing the technical aspects of N6
TM
were
developed for each coder. Initially the coders completed the coding of the same two interviews.
Coding was compared and the coding tree was discussed. Early in the coding process weekly
meetings were held with the coders; as the study progressed meetings were decreased to every
two weeks. These meetings provided an excellent opportunity both to discuss the development of
new themes and to question and confirm saturation of themes.
In-depth analysis was completed by two experienced qualitative research team members with
different healthcare backgrounds (nursing and occupational therapy). Analysis was completed by
provider group; each researcher examined the data for specific groups of providers (e.g., RNs,
LPNs, etc.). While each of the researchers examined descriptions of nursing scope of practice and
barriers and facilitators to enactment of scope of practice, patterns across the data were also
examined (Richards, 2005). Assigning data sets to different researchers (Gaskell & Bauer, 2000)
was seen as an appropriate and manageable approach to in-depth data analysis of this large data
set. Data overload, fatigue, and the potential for the researcher to “get lost in the data” posed real
challenges for the analysis of the large amount of data collected for this study. Dedicating two
researchers, who met and discussed regularly, to the process of in-depth analysis assisted in
managing these challenges. Researchers frequently documented their processes and
interpretations as memos directly in the software program. Summary analysis documents were
also created for each of the data sets analyzed. Meeting regularly was also important in making
meaningful sense of all the data in this study.
Although N6
TM
was an excellent program for organizing the qualitative data, challenges were
encountered in merging projects between coders (initial analysis) and again between researchers
once the in-depth analysis was completed. QSR Merge
TM
is designed to merge one project with
another project, but with the coders and researchers each developing separate N6
TM
projects (five
different projects in total) merging was not seamless. Duplication of transcripts and codes
required that one individual be associated with one transcript to prevent duplication in the final
N6
TM
for the complete analysis of all the data for the study. Although challenges were
encountered in merging the projects, we were successful in creating one complete N6
TM
project.
Confirmability
Confirmability (Guba, 1981; Johnson & Waterfield, 2004) of research results was ensured via
four key processes: the creation of an audit trail; an internal audit; an external audit; and the
writing of the final research report.
Audit Trail
A detailed, comprehensive accounting of all data collection and data analysis activities was
completed. Changes were documented as they were made, along with rationale for the change.
International Journal of Qualitative Methods 2012, 11(3)
252
Accurate and comprehensive records of the methods employed in data collection and analysis by
researchers in the study were recommended by qualitative research experts (Lincoln & Guba,
1985; Sandelowski, 2000). Such audit trails provided not only a solid methodological reference
for the reader, but also provided an opportunity for reflective reasoning (on the themes or
categories chosen, interpretations, etc.) for the researchers as the study progressed (Guba, 1981;
Johnson & Waterfield, 2004; Whittemore et al., 2001). For example, if methodology changed at
some point in the study, an audit trail would keep a record of when, why, and what changes were
implemented. Such audit trails became especially useful for managing large data sets and placing
data, methodology, and interpretation within the particular context in which they belonged.
Internal audit
Internal audits of coding and themes for the study were completed at three different intervals
(after 10, 25, and 45 interviews were coded) during the analysis of the study. The purposes of
these audits were to assess inter-rater reliability and to determine similarities and differences in
key themes identified by coders and auditors. Audits were conducted by three auditors, each
members of the research team. The audit included interviews of nurses (RNs, RPNs, and LPNs),
nurse managers, interdisciplinary team members, and patients. The sample of interviews was
based on a stratified selection by profession, education, unit, and health region to ensure
maximum variability of codes and themes. Transcripts for review were then randomly selected
from these stratified data sets. Internal audit results are outlined in Table 1.
International Journal of Qualitative Methods 2012, 11(3)
253
Table 1: Internal audit results
Coder Auditor Interview Inter-rater Reliability Common Themes
Audit 1 (following 10 interviews coded by each coder)
a
002 001 2055 Not applicable Themes from auditors were
compared to a summary of
initial findings
Similar themes were found by
both (e.g., lack of time,
fragmentation of care, lack of
role clarity and role definition,
role overlap) with the
exception of two additional
themes, one in the summary
(language) and one by auditors
reviewing interviews (job
stressors)
002 002 4414 76% accuracy
002 003 6041 Not applicable
003 002 6046 Not applicable
003 003 2029 74% accuracy
003 001 4106 Not applicable
004 003 4302 Not applicable
004 002 2028 Not applicable
004 001 6062 41% accuracy
Audit 2 (following 25 interviews coded by each coder)
b
003 003 2000 Greater inter-rater reliability
between coders than auditors
Auditor 001consistently out of
range as noted in Audit 1
Auditor discrepancy likely
related to different style of
coding, language, and
interpretation
Positive coder reliability likely
due to amount of coding
completed, interaction amongst
coders, and consistent
attendance at coder meetings
Themes were compared to
themes identified in a second
summary report to the
Advisory Committee
Similarities and consistency in
themes (e.g., time, role
overlap, importance of
communication, role clarity,
workload) were noted between
the audited interviews and the
summary report
003 003 2083
003 003 2011
004 001 6073
004 001 6060
004 002 4201
002 002 6035
002 002 6040
002,
003,
004
001,
002,
003
4203
Audit 3 (following 45 interviews coded by each coder)
002 001 2076 As inter-rater reliability was
completed in prior audits, it was
not completed at this time
Themes were compared to the
final research report
Themes were very similar
(e.g., role overlap, role clarity,
time, continuity of care,
communication, workload),
although themes were
presented more broadly in the
final report
003 001 2036
004 001 6069
002 002 6017
003 002 2037
004 002 4308
002 003 2022
003 003 6076
004 003 4213
a=inter-rater reliability conducted on three randomly selected interviews from audit; compared coding of
coders to coders and auditors to coders
b=inter-rater reliability conducted on one interview; compared coding among auditors and one coder
International Journal of Qualitative Methods 2012, 11(3)
254
Overall, the internal audit showed positive results in inter-rater reliability of coding and common
themes identified from data analysis. Although discrepancies were found between coding
completed by coders and auditors, of note was the consistency in coding amongst coders. For the
research team this reliability emphasized the importance of the regular meetings with coders to
discuss node definitions and clarify where data elements best fit in the coding structure. The lack
of consistency in coding between auditors and coders was not unexpected. Coding of qualitative
data will be largely interpretive in nature; therefore, researchers’ insight and language will be
highly individual (Morse & Richards, 2002). The important finding in the internal audit was the
consistency in the themes identified from the data, which reinforced for the research team that the
right course was being pursued and the team should continue data analysis in the manner in which
it was being conducted. The internal audit also facilitated the opportunity for researchers to
engage with the data.
External audit
An expert in qualitative data analysis completed an external audit of the data. The reviewer was
not associated with the study in any way. Audit questions were developed from the work of Flick
(2002) and Miles and Huberman (1994) and reflected an assessment of the procedures undertaken
in the process of conducting the study. Questions are outlined in Table 2.
Table 2:
External audit questions
External audit questions
1. Were the findings grounded in the data?
2. Were the inferences logical?
3. Were the category structures appropriate?
4. Were the decisions and methodological shifts justified?
5. Did researcher bias exist?
6. What strategies were used to increase credibility?
Overall, the external audit review was very favourable. The reviewer noted that the sample was
selected in a manner such that the units selected and the perspectives of various categories of
nurses were obtained. The summary reports provided to the external reviewer by the project
manager served as a complementary component to the data managed in N6
TM
; the connections
between identified categories and the data were easily accessible in a systematic manner. Team
meetings demonstrated that the reports were discussed at length, resulting in decisions based on
the data and documentation of changes to the framework.
Furthermore, the review confirmed that inferences made in the data were logical. More
specifically, there was sufficient data for the thematic categorical structures of assessment,
accountability, responsibility, coordination of care, general tasks, patient safety, patient
education, role overlap and ambiguity, autonomy, working to full scope, facilitators, barriers, and
recommendations for unit-based change. Conclusions drawn for these codes were very robust.
The reviewer did note that data were less developed for the codes of critical thinking, problem
solving, isolation, discontent, conflict, respect, and burnout.
The research team was commended for linking all inquiry decisions to the purpose and the
strategies of the study. Specific activities, such as attention to the multidisciplinary nature of the
International Journal of Qualitative Methods 2012, 11(3)
255
research team; bi-weekly meetings with coders; creation of a detailed audit trail; documentation
of the coding framework; and execution of an internal audit, were highlighted by the external
auditor as important in increasing the trustworthiness of the study. In terms of research bias,
while the researchers were commended for excellent use of follow-up questions to collect
additional descriptive information, it was suggested that deliberate recruitment of participants
who might hold contrary views to the researchers would have strengthened this component of the
review. Overall, the research team was commended for the data collection, analysis, and
interpretation of this very large qualitative study.
Research report
The final research report was written in such a way as to increase the confirmability of research
results. The report highlighted the shortcomings of the study and provided transparent links
between study results and the actual experiences of the participants in the study (Guba, 1981). To
this end, limitations of the study were outlined and quotations from participants were included to
represent themes identified in the study.
Strengths and Limitations
Several strengths of this study were noteworthy. First, given the large number of interviews
completed, a robust description of the scope of practice of nurses in acute care and the barriers
and facilitators impacting their ability to practice to full scope was clearly evident. During the
management and analysis of the data, we, as researchers, were reflexive and engaged in many
strategies that assisted us in questioning how our knowledge, position, and experience potentially
influenced or shaped analysis and interpretation of research results (Pyett, 2003). When the
findings of this study were discussed both formally and informally with nurses and other
professionals from jurisdictions across Canada, we found that the results seemed to resonate with
those colleagues. We, therefore, are reasonably confident that the findings from this research
represented a current state that potentially characterizes many health care settings.
Several limitations in the methodology to manage the qualitative data for this study were
identified. One key limitation was the inability to simultaneously analyze the data in an iterative
manner to inform the interview process. This was difficult because data were collected by three
research assistants across three geographically diverse sites. Working across sites was particularly
challenging. Timelines were also difficult to manage given the magnitude of the study.
Conclusion
Although there were a variety of challenges in managing the large volume of data generated by
the large number of interviews, the external audit report confirmed for the research team the
strengths of the strategies implemented to manage the data and ensure the quality of the data
analysis. We believe that the collective attention to data collection and analysis—via the training
of interviewers and coders, careful development of the coding framework, expertise of the
qualitative researchers in the analysis of the data, and attention to the development of an audit
trail— has contributed to a rich description of the scope of practice of nursing providers and the
barriers and facilitators to enactment of their scope of practice. Both the internal and external
audit also demonstrated the researchers’ commitment to remaining true to the findings. As
emphasized in the external audit, researchers utilized rigorous methodology both to manage the
data and to ensure that the data analysis captured the unique experience of participants (Ayres,
Kavanaugh, & Knafl, 2003). These data management methodologies have been employed as a
International Journal of Qualitative Methods 2012, 11(3)
256
template for other large research studies in which data were collected across sites with multiple
interviewers, participants, and coders.
The research team makes eight recommendations to help ensure rigour in the management of
large scale qualitative studies. First, the importance of the organization of the study cannot be
underestimated. One person must take on the role of managing the study. The organization of
staff, scheduling, data collection, data analysis, and the data itself is essential to the success of the
project. Second, diligent documentation of data collection and analysis details (e.g., changes in
approach and rationale) is required. This responsibility is best assumed by one person on the
research team. Third, ensuring a strict timeline for data collection, coding, and analysis is
essential. Fourth, make every effort to use an iterative process for data collection and analysis.
Fifth, conduct, at a minimum, a comprehensive internal audit at key points throughout the study.
We would encourage researchers to undertake an external audit to further increase the credibility
of the study. An external audit also provides an excellent learning opportunity for research team
members. Sixth, regular communication between team members is critical to ensure quality
completion of the study. Regular email contact, phone conversations, and face-to-face and
teleconference meetings are recommended. Seventh, adequate resources are required to ensure
timeliness and quality of results. Resources include both financial and human resources. Finally,
maintain a good sense of humour and build in time to reflect and have fun. The commitment to
large qualitative research is enormous and requires a team effort, with diversion from time to
time.
There is a lack of scientific literature regarding the structures and processes for managing large
qualitative data sets. This article provides concrete examples and recommendations for managing
these large scale qualitative studies to ensure rigour of study results. The external audit completed
by an expert qualitative researcher validates the processes and confirms the successful
management of this large data set and research study. This information will be invaluable as
researchers continue to answer complex health services research questions that inevitably result in
large qualitative data sets.
International Journal of Qualitative Methods 2012, 11(3)
257
References
Advisory Committee on Health Human Resources. (2002). Our health, our future: Creating
quality workplaces for Canadian nurses. Final report of the Canadian Nursing Advisory
Committee. Ottawa, ON: Health Canada.
Aita, V. A., & McIlvain, H. E. (1999). An armchair adventure in case study research. In B. F.
Crabtree & W. L. Miller (Eds.), Doing qualitative research (2
nd
ed., pp. 253-268).
Thousand Oaks, CA: Sage.
Ayres, L., Kavanaugh, K., & Knafl, K. A. (2003). Within-case and across-case approaches to
qualitative data analysis. Qualitative Health Research, 13, 871-883.
Creswall, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods
approaches (3
rd
ed.). Thousand Oaks, CA: Sage.
Flick, U. (2002). Introduction to qualitative research (2
nd
ed.). Thousand Oaks, CA: Sage.
Fyke, K. (2001). Caring for Medicare: Sustaining a quality system. Saskatchewan Commission
on Medicare. Regina, SK: Government of Saskatchewan.
Gaskell, G., & Bauer, M. (2000). Towards public accountability: Beyond sampling, reliability
and validity. In M. Bauer & G. Gaskell (Eds.), Qualitative researching with text, image
and sound (pp. 336-350). London, UK: Sage.
Guba, E. G. (1981). Criteria for assessing the trustworthiness of naturalistic inquiries.
Educational Communication and Technology Journal, 29, 75-91.
Johnson, R., & Waterfield, J. (2004). Making words count: The value of qualitative research.
Philosophy Research International, 9, 121-131.
Knafl, K. A., & Ayres, L. (1996). Managing large qualitative data sets in family research. Journal
of Family Nursing, 2, 350-364.
Kvale, S. (1996). An introduction to qualitative research interviewing. Thousand Oaks, CA: Sage.
Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Newbury Park, CA: Sage.
Miles, M., & Huberman, A. (1994). Qualitative data analysis: An expanded sourcebook (2
nd
ed.).
Thousand Oaks, CA: Sage.
Miller, W. L., & Crabtree, B. F. (1992). Primary care research: A multimethod typology and
qualitative road map. In B. F. Crabtree & W. L. Miller (Eds.), Doing qualitative research
(pp. 3-28). Newbury Park, CA: Sage.
Morse, J. M., & Richards, L. (2002). Readme first for a user’s guide to qualitative methods.
Thousand Oaks, CA: Sage.
Oelke, N. D., White, D., Besner, J., Doran, D., McGillis-Hall, L., & Giovannetti, P. (2008).
Nursing workforce utilization: An examination of facilitators and barriers on scope of
practice. Nursing Leadership, 10(1), 58-71.
http://books.google.com/books?hl=en&lr=&id=2oA9aWlNeooC&oi=fnd&pg=PA5&sig=GoKaBo0eIoPy4qeqRyuozZo1CqM&dq=naturalistic+inquiry&prev=http://scholar.google.com/scholar%3Fq%3Dnaturalistic%2Binquiry%26num%3D100%26hl%3Den%26lr%3D
International Journal of Qualitative Methods 2012, 11(3)
258
Polit, D. F., & Hungler, B. P. (1999). Nursing research: Principles and methods (6
th
ed.).
Philadelphia, PA: Lippincott Williams & Wilkins.
Pyett, P. (2003). Validation of qualitative research in the “real world.” Qualitative Health
Research, 13, 1170–1179.
Richards, L. (2005). Handling qualitative data: A practical guide. London, UK: Sage.
Sandelowski, M. (1997). “To be of use”: Enhancing the utility of qualitative research. Nursing
Outlook, 45(3), 125-132.
Sandelowski, M. (2000). Whatever happened to qualitative description? Research in Nursing &
Health, 23, 334-340.
White, D., Oelke, N. D., Besner, J., Doran, D., McGillis-Hall, L., & Giovannetti, P. (2008).
Nursing scope of practice: Descriptions and challenges. Nursing Leadership, 10, 44-57.
Whittemore, R., Chase, S., & Mandle, C. (2001). Validity in qualitative research. Qualitative
Health Research, 11, 522-537.
Copyright of International Journal of Qualitative Methods is the property of International Institute for
Qualitative Methodology and its content may not be copied or emailed to multiple sites or posted to a listserv
without the copyright holder’s express written permission. However, users may print, download, or email
articles for individual use.
A
s next-generation technology ratchets the price of
sequencing lower and lower, users from aca-
demic labs to Big Pharma are fi nding themselves
drowning in data. What used to be gigabytes
worth of information has become terabytes or
petabytes. At the same time, the cost crunch
brought on by the global recession has made
researchers leery of unnecessary capital spending.
The result is more and more users moving their data management
to the cloud or outsourcing it entirely.
Whereas large pharma companies may have the funds and
infrastructure to maintain dedicated servers for storage and analysis
of sequencing data, small companies—especially those that don’t
sequence continuously—are leading the migration to the cloud, and
service providers are springing up to meet the demand. Cost, secu-
rity, and convenience top the list of concerns for researchers looking
for a place to unload reams of data. However, once that transition is
made, features like collaborative data sharing, access to third-party
analysis apps, and patient privacy become more important.
Expression Analysis, a Quintiles Co., Durham, N.C., provides
genomic services to the pharma and biotech industry, as well as
academic, government, and foundation laboratories doing research
in molecular biology and genetics. It provides cloud computing
services through a partnership with Golden Helix Inc., Bozeman,
Mont. Its clients require computation-intensive services for gener-
ating the initial RNA or DNA sequence and also for cleaning up,
aligning, and analyzing the sequence.
According to Expression Analysis, a typical sequencing project
for 100 RNA samples would generate 300 to 400 GB worth of com-
pressed data, or 700 GB to 1 TB worth of data in total; and that’s
just for one experiment. For multiple experiments, the amount of
data can add up to astronomical quantities quite quickly.
Some applications, such as analyzing cancer samples, are even
more data intensive, because of the depth of coverage and the need
to sample multiple cells in the tumor.
“The cloud offers a full environment in order to do analysis on
a large number of samples simultaneously,” said Wendell Jones,
PhD, vice president of statistics and bioinformatics for Expression
Analysis.
That computing power becomes a commodity for the customer,
replacing expensive, on-site, server infrastructure. The data is
instead accessed through a browser, and there is no need to upload
or download huge fi les. “You can leave them on the cloud and
access in a streaming fashion via the cloud,” Jones said.
For small companies, the cloud-based service offers additional
advantages beyond saving on hardware and real estate. Startup
companies may not have the structure in place to operate Linux-
based genome software applications. A cloud-based storage and
analysis service allows those companies to use their own local
Windows or Macintosh desktop operating systems.
There are some advantages to maintaining a physical server.
“You have the option of having lower redundancy … and faster
data access times. You can choose to take your old data and
unplug it. You don’t have to pay for power,” explained Jonathan
Bingham, product manager for informatics and software for
Menlo Park, Calif.-based Pacifi c Biosciences, a provider of genom-
ics services through its SMRT platform technology and hosted
cloud-based storage and analysis service.
10 � September/October 2012
www.dddmag.com
Exploding sequencing data volumes push researchers to
the cloud and into partnerships.
� COVER STORY
� Catherine Shaffer, Contributing Editor
Managing
Data in the
Cloud Age
dd29_10_COS.indd 10dd29_10_COS.indd 10 10/1/2012 11:41:49 AM10/1/2012 11:41:49 AM
On the other hand, that means taking
responsibility for managing the hardware,
Bingham added, such as replacing failed
drives. That burden of ownership and main-
tenance is not right for every company.
Jones explained that cloud comput-
ing is ideal for research groups that have
“bursty” computing needs, meaning that
generating and analyzing sequence data is
an intermittent need.
“The cloud in some sense is cheap, in
the sense that it’s cheaper to rent a vacation
home than buy it and only use it two or three
weeks a year. If you’re constantly at your
vacation home, it’s just better to buy it.”
Cost is a major concern at Illumina (San
Diego, Calif.) as well. A giant in the sequenc-
ing industry, Illumina controls 70% of the
market share for sequencing. Illumina can
sequence an entire human genome in a day,
and it offers its cloud-computing solution,
BaseSpace, through Amazon Web Services
(Seattle), the world’s largest cloud hosting
service. Recently, Amazon announced a ser-
vice providing reliable data storage starting at
$0.01 per gigabyte per month.
Although that is a very economical rate
for data storage by any standard, for long-
term storage of hundreds or thousands of
complete genomes, many experts agree it is
better to store the data in the original tissue.
In other words, if the raw data is needed
again in the future, it is cheaper to regener-
ate the sequence from an archived sample.
Illumina offers an even better deal to
its customers. “We’ve picked the ultimate
pricing strategy which is free,” said Alex
Dickinson, senior vice president of cloud
DRUG Discovery & Development September/October 2012 � 11
genomics for the company. “Customers get
a free terabyte of data storage, enough for
10 years of typical usage of MiSeq. We do
the secondary processing, alignment, and
variant calling. We also do that for free,”
Dickinson said.
MiSeq is Illumina’s “personal
sequencer,” a next-generation sequenc-
ing system suitable for applications such
as multiplexed PCR amplicon sequenc-
ing, targeted resequencing, small RNA
sequencing, and so forth.
Illumina’s choice to offer free service
is based on concerns of researchers, who
may be comparing the company’s offerings
to use of infrastructure in their facil-
ity. Although in an absolute sense, that
infrastructure is never “free,” because of
the cost of housing it in the facility, its use
often doesn’t come out of an individual
laboratory budget. “If you try to charge
for basic service, they try to compare that
to free,” Dickinson said.
Instead of charging customers directly,
Illumina instead channels revenue through
third-party service providers, who will be
offering genomic analysis apps within the
sequencing environment. The application
interface (API) for BaseSpace will be open
to partner companies to offer applications
that will be available in an app store. An
initial block of 14 companies are already
signed up to offer those apps.
Although Amazon cloud services
provide an ideal solution for research,
the rapidly emerging market for clinical
sequencing comes with tougher regulatory
requirements, chief among them compli-
ance with the Health Insurance Portability
and Accountability Act of 1996 (HIPAA).
Amazon cloud services are not cur-
rently HIPAA compliant, and according to
Richard Resnick, CEO of GenomeQuest Inc.
(Westborough, Mass.), it is very unlikely to
become compliant any time soon. Resnick
said that the cloud is comprised of three
components: application, platform, and
hardware. Achieving HIPAA compliance
requires control of all three of those com-
ponents. A service that is designed around
coordination of many third-party providers
such as Amazon would have a hard time ever
validating full compliance for the entirety of
its applications, platform, and hardware.
“What we’re doing is thinking about
how to connect different parts of the
health care ecosystem through next-gener-
ation sequencing and cloud-based genom-
ics,” said Resnick.
GenomeQuest offers a secure HIPAA-
compliant cloud designed for large scale
analysis of whole genomes and gene panel
samples from clinical laboratories.
Resnick said that unlike research labs,
clinical laboratories can’t tolerate problems
like noise and false positives in their data.
“You can’t do that because there’s a real
patient at the end of the day.”
So in addition to security and data pri-
vacy standards, cloud services for clinical
sequencing applications have a higher bar
to achieve for quality.
“There are still many uncertainties
around the regulatory requirements for
using cloud and hosted IT services in
genomic medicine trials, so it was impor-
tant for us to work with a company that
really understands the healthcare IT
space,” said Spyro Mousses, PhD, director
of the Center for BioIntelligence at The
Translational Genomics Research Institute
(TGen) in Phoenix, Ariz.
BaseSpace enables users to perform interactive
genetic analysis from any location using a web
browser.
Sequence data analysis results depicted in the DNANexus genome browser.
dd29_10_COS.indd 11dd29_10_COS.indd 11 9/28/2012 3:00:59 PM9/28/2012 3:00:59 PM
www.dddmag.com
� COVER STORY
In November 2011, TGen partnered
with Dell to support the world’s fi rst
personalized medicine trial for pediatric
cancer, and to leverage cloud comput-
ing resources donated by Dell. The Dell
Giving commitment includes multi-year
grant funding to support the clinical trial,
as well as major hardware, software, and
services contributions.
Focusing initially on neuroblastoma,
the trials will leverage high-performance
computing to dramatically accelerate the
processing of sequencing information from
patient tumors to predicting the optimal
treatment for each patient. As would be
required of any trial under U.S. Food and
Drug Administration (FDA) regulations, the
cloud solution will be compatible with both
FDA and HIPAA compliance requirements.
The KIDS Cloud, as TGen terms it,
“will provide a hybrid-cloud platform for
securely storing and exchanging genomic
data and clinical information across mul-
tiple collaborating organizations,” accord-
ing to Mousses.
TGen is also participating in several
other large personalized medicine trials
and hopes that the kind of cloud-enabled
computational infrastructure can serve as
a national model for collaborative person-
alized medicine. “It takes a village to cure
a kid with cancer,” Mousses said.
With the advent of next-generation
sequencing technology, the emphasis has
shifted from bringing the cost of sequenc-
ing down to addressing the cost of analy-
sis. “The bottleneck now is being able to
effectively analyze the data,” said Marc
Olsen, president and COO of DNANexus
(Mountain View, Calif.), a provider of
cloud-based data management and analy-
sis. Those challenges include not only the
cost of storage and management of quanti-
ties of data that could fi ll thousands and
thousands of PCs, but questions of how
to transfer data, and how to share and
collaborate while still maintaining security
and privacy. The industry is currently seek-
ing answers to those emerging problems,
and in some cases already moving towards
some degree of standardization. �
Catherine Shaffer is a freelance science writer
specializing in biotechnology and related
disciplines with a background in laboratory
research in the pharmaceutical industry.
Expression Analysis’ cloud computing pipeline,
powered by Golden Helix, can access virtually
infi nite storage capacity on the fl y and distribute
large jobs across hundreds of servers in parallel. A
dashboard showing computing progress for a 15
sample RNA-Seq project is shown above.
ANTIBODY PROBLEMS?
Have difficult targets to develop effective antibodies?
What if an antibody doesn’t exist for your
target/antigen? Aptagen develops and
manufactures aptamers which are ligands of
RNA/DNA and peptide oligos that bind to a variety of
target antigens. Aptamers are sometimes referred to
as “chemical antibodies or DNA antibodies.”
…AND MUCH MORE ONLINE
For Example: Aptamers have been
generated that exhibit greater than
10,000-fold binding affinity for
theophylline over caffeine, which
differ from one another in structure by
only a single methyl group.
dd29_10_COS.indd 12dd29_10_COS.indd 12 9/28/2012 3:01:50 PM9/28/2012 3:01:50 PM
Copyright of Drug Discovery & Development is the property of Advantage Business Media and its content may
not be copied or emailed to multiple sites or posted to a listserv without the copyright holder’s express written
permission. However, users may print, download, or email articles for individual use.
Education
Advanced Technologies and
Data Management Practices in
Environmental Science: Lessons
from Academia
REBECCA R. HERNANDEZ, MATTHEW S. MAYERNIK, MICHELLE L. MURPHY-MARISCAL, AND MICHAEL F. ALLEN
Environmental scientists are increasing their capitalization on advancements in technology, computation, and data management. However, the
extent ofthat capitalization is unknown. We analyzed the survey responses of 434 graduate students to evaluate the understanding and use of
such advances in the environmental sciences. Two-thirds of the students had not taken courses related to information science and the analysis of
complex data. Seventy-four percent of the students reported no skill in programming languages or computational applications. Of the students
who had completed research projects, 26% had created metadata for research data sets, and 29% had archived their data so that it was available
online. One-third of these students used an environmental sensor. The results differed according to the students’ research status, degree type, and
university type. Changes may be necessary in the curricula of university programs that seek to prepare environmental scientists for this techno-
logically advanced and data-intensive age.
Keywords: data life cycle, data repository, education, environmental sensors, eScience
With the advent of recent technological and computationaladvances, scientists are using increasing numbers of
in situ environmental sensors, model simulations, crowd-
sourcing tasks, and embedded networked systems that
enable environmental studies to incorporate various spatio-
temporal scales and to produce utiprecedented amounts
of data (Porter et al. 2005, Benson et aL 2010). Such tech-
nologies and an increasing interest in synthesis studies of
environmental phenomena have made data valuable beyond
their immediate use (Peters et al. 2008). The flood of data
that digital technologies produce (Hey and Trefethen 2003)
underscores the urgency of a rapid adoption of pertinent
skills and best practices by environmental scientists in the
proper management of data sets. Studies in which such
preparedness in the environmental sciences is evaluated
are absent; however, academic institutions may play a role
in imparting the relevant knowledge and skills to the next
generation of scientists.
As electronic devices become smaller and cheaper and
as complementary computer power grows and applications
increase in efficiency, scientists at all career stages are finding
technology useful for addressing topics from global epidem-
ics to climate change. Such integration has transformed
both the experimental techniques and the solitary working
platforms known by predecessors in the field in the not-so-
distant past (Nature 2003). But the use of technology and
interdisciplinary collaborations often necessitates analytical
tools for the integration and analysis of large and hetero-
geneous data sets. In a survey of a distributed seminar course
for ecology graduate students incorporating 11 American
universities, Andelman and colleagues (2004) found that
over 90% of the students did not have skills in the scripted
programming languages that they considered essential for
large data set integration and analysis. The degree to which
academic institutions have modified their curricula or
programs in anticipation of increasing demand for scien-
tists with technological and computational competency is
unknown.
Another trend yet to be quantified is an increase in the num-
ber of environmental scientists who follow proper data man-
agement practices to improve their research. Exemplifying this
trend, the National Science Foundation (NSF) now requires
that all grant applications include data management plans
(NSF 2010). Regardless of the size of a project or its associated
data products, creating and following through with such plans
requires fulfilling metadata requirements and completing the
BioScience 62: 1067-1076. ISSN 0006-3568, electronic ISSN 1525-3244. © 2012 by American Institute of Biological Sciences. All rights reserved. Request
permission to photocopy or reproduce article content at the University of California Press’s Rights and Permissions Web site at www.ucpressjournals.com/
reprmtin/o.flsp. doi:10.1525/bio.2012.62.
1
2.8
WWW. biosciencemag. org December2012/Vol. 62No. 12 • BioScience 1067
Education
data life cycle (e.g., collection, management, interpretation,
long-term archiving; Wallis et al. 2010). Metadata are the
documentation and annotations used to manage, share, and
preserve data resources. Many believe that metadata standards
are critical for overcoming widespread problems of linguistic
uncertainty that can render environmental data unshareable
(Regan et al. 2002). The degree to which programs and advis-
ers in the environmental and ecological sciences are instructing
graduate students to correctly capture and record metadata or
to use metadata standards, such as the Ecological
Metadata
Language (EML), is unknown.
In addition, it is unknown whether programs and advisers
are supporting and conveying the responsibility of proper
data archiving in online data repositories (e.g.. Dryad; www.
datadryad.org) and thereby completing the data life cycle.
When graduate students are not trained in data archival
methods or do not take independent action to archive their
graduate research data sets, they may be less likely to archive
data sets in future research endeavors. As an example, the
Networked Digital Library of Theses and Dissertations
already contains over one million graduate products whose
original data may be available only by contacting the
author, or even worse, the data may have been misplaced.
The continuance of this practice would be a huge loss of
opportunity to the academic community, however large or
small each individual student’s data set may be, especially if
the number of graduate degrees awarded continues to grow
(see supplemental figure SI, available online at http://dx.doi.
org/10.1525/bio.2012.62.12.8).
In this study, our first goal was to evaluate the technologi-
cal and computational experience of environmental scien-
tists and their data management practices in the formative
stages of their career. Specifically, we were interested in the
breadth of coursework completed by environmental graduate
students that was germane to computational and information
science and to the analysis of large and complex data sets. We
also sought to determine the proficiency levels of graduate
students with analytical tools, including programming lan-
guages and computational applications that are frequently
employed in environmental studies. Finally, we evaluated the
students’ data management practices, environmental sensor
use, and interdisciplinary collaborations, comparing between
those who had completed and those who had not completed
their master’s research project or dissertation. A secondary
goal was to compare master’s students with doctoral students
and also to determine whether differences exist among differ-
ent institution types in California. Specifically, we surveyed
private California universities, the University of California
(UC), and California State University (CSU). Private univer-
sities differ in their major funding sources, whereas the
latter two differ in their function (i.e., institutions with
exclusive jurisdiction in PhD and professional instruction or
undergraduate-focused institutions with primarily master’s
degree graduate programs, respectively; Douglass 2007).
Using survey responses of current and former graduate
students, we highlight the degree to which academia is
facilitating the integration of technology, computation, and
data management in the environmental sciences and dis-
cuss its implications for the contribution of research data
products to the greater body of scientific knowledge. Finally,
we draw on these results to elucidate methods by which
environmental scientists at all career stages may excel in this
technological and data-intensive era.
Graduate students’ responses and the data-
collection process
During the months of June, July, and August 2011, we
conducted an online survey (using www.surveymonkey.com;
see supplemental form 1). We solicited responses from
master’s and doctoral students in academic departments
related to environmental or ecological sciences from 27
California universities, including 4 private schools, 9 public
universities in the UC system, and 14 public universities in
the CSU system. CSU institutions offer research-based mas-
ter’s degrees and, in general, do not support doctoral pro-
grams. All private universities and UC institutions surveyed
support both master’s and doctoral programs; however, all
of the survey respondents for these university types were
planning to complete or had completed a doctoral degree.
We excluded universities that did not respond to requests
for participation and from whose students we received
fewer than three responses. Private universities were those
classified as research institutions by the Association of
Independent California Colleges and Universities {n = 7),
that offer an environmental- or ecology-related graduate
program (« = 4), and that were receptive to participation
(n > 3). In total, 23 universities, including 18 academic pro-
grams from 11 California State Universities, 16 academic
programs from 9 Universities of California, and 4 academic
programs from 2 private universities, were represented.
The survey responses were solicited through e-mail.
When it was possible, we sent e-mail solicitations to gradu-
ate student electronic mailing lists within each surveyed
department. If such mailing lists were not available, we
collected student e-mail addresses from online department
directory pages and e-mailed the students directly. For a
few surveyed universities, we also e-mailed faculty members
within the relevant departments and asked them to forward
our solicitation e-mail to students. If our first solicitation to
a particular department did not result in responses, we sent
a second solicitation e-mail. Students who had completed
their graduate degree more than two years prior or answered
no to the question “[Do] your education and research foci
fall within the ecological or environmental sciences?” were
excluded from our analyses.
The response rates were difficult to calcu]ate, because the
survey was, in most cases, sent to departmenta] mailing lists,
the sizes of which were unknown. Instead, we counted the
number of students listed on departmental Web pages. Using
this proxy measure, we calculated approximate response rates
of 23% for the UC sample and 25% for the private sample.
We did not calculate a response rate for the CSU sample.
1068 BioScience • December2012/Vol. 62No. 12 www. biosciencemag. org
Education
because department lists were not provided. We processed
and statistically analyzed all of the survey data using scripts
in R iwww.r-project.org). For all of the survey questions,
means were derived using the number of responses for each
university as a weight, and the associated 95% confidence
intervals (CI) were reported. We determined the differences
in responses among the three university types and between
the master’s and doctoral students by using chi-squared anal-
yses based on counts derived at the response level. We used
Student’s f-test scores to determine significant differences
between the responses of those students with thesis or dis-
sertation research in progress and those who had completed
their research on the basis of weighted means at the individ-
ual university level. It was possible that the students would
respond that their research project was both completed and
in progress; this scenario occurred, for example, when a
student had progressed from a research-based master’s to a
doctoral program.
Survey results
In total, 498 graduate students responded to the survey,
and of those, 434 met the study’s criteria. The number of
eligible responses varied according to the student’s thesis or
dissertation status (progress, n = 326; completed, « = 131),
according to their education level (master’s student, « = 124;
doctoral student, n = 385), and according to university type
(California State University, n = 124; University of California,
n = 261; private university, n = 49) (supplemental table SI).
Coursework. Over 80% (82.3%, 95% CI = 5.3; table 1) of
the students in our survey stated that they had completed
none of the eight computer and information science courses
evaluated in this study. Over 20% of the students had com-
pleted coursework in introductory computing (23.8%, 95%
CI – 5.9) and computer programming (22.9%, 95% CI – 4.6).
The students completed the least amount of coursework in
networking, metadata, and information technology. The stu-
dents showed little intention of eventually taking additional
courses in this discipline (1.0%, 95% CI = 1.6), but that
intention was numerically greatest for bioinformatics and
computational biology (2.4%, 95% CI = 3.8).
A large number of the students—74.6% (95% CI = 6.0)—
stated that they had not completed any coursework related
to the management and analysis of complex data (table 2).
Approximately one-third (30.5%, 95% CI = 6.4) of the
students stated that they had taken at least one course
in geographic information systems (CIS), 29.2% (95%
CI = 6.3) had taken coursework in tnodeling, and 19.6%
(95% CI = 6.1) had taken courses in spatial analysis. Less
than 20% of the students had taken a course in remote sens-
ing (16.1%, 95% CI = 5.8), time series analysis (12.1%, 95%
CI = 3.2), meta-analysis (6.9%, 95% CI = 3.4), or data min-
ing (4.9%, 95% CI = 3.0).
Skills. A majority—74.0% (95% CI = 6.6)—of the students
stated that they had no skills in the programming languages
and computational applications evaluated in this survey.
Only 17.2% (95% CI = 4.7) of the students, on average,
stated that they had basic skill levels in these areas. The stu-
dents had the least experience with EML (99.1% stated that
they had no experience, 95% CI = 4.7; figure 1), Java (90.5%,
95% CI = 12.1), or IDL (Interactive Data Language; 90.5%,
95% CI = 0.7). The students claimed a basic skill level or
higher in GIS (e.g., ArcGIS; 55.5%) and statistical applica-
tions, including R (55.9%), and JMP, SPSS, or SAS (53.0%).
Advanced technologies. Approximately one-third (36.7%, 95%
CI = 8.7) of the students whose program was still in prog-
ress planned to use environmental sensors in their research
study (figure 2). This number paralleled the percentage of
Table 1. The mean percentage of surveyed graduate students who
to computational and information science.
Introductory computing
Computer programming
0
courses
completed
Mean
6
9.4
63.8
Data structures or algorithms 8
1.7
Networking
Information technoiogy
Database management
Metadata
Bioinformatics or
computational biology
AH courses
95.1
9
0.8
8
6.1
9
4.2
7
6.9
8
2.3
Abbreviation: CI, confidence interval.
95% CI
6.7
8.0
5.7
2.6
4.9
4.0
4.1
6.5
5.3
‘The survey stated, “0, but I will take one soon.”
1 course
completed
Mean
23.8
2
2.9
14.2
3.3
7.4
1
1.0
4.4
15.5
12.8
95% CI
5.9
4.6
4.4
2.3
4.3
3.5
4.0
4.7
4.2
had taken or intended to i
2 courses
completed
Mean
4.2
4.6
1.8
0.7
0.7
1.6
0.7
3.6
2.2
95% CI
2.9
2.5
1.2
0.6
1.0
1.1
1.8
1.7
1.6
ake courses
in subjects related
3 or more courses
completed
Mean
1.8
6.8
1.1
0.5
0.7
0.5
0.2
1.6
1.7
95% CI
1.1
3.2
1.6
0.6
0.6
0.6
0.5
1.9
1.3
Intended future
course”
Mean
0.7
1.8
1.1
0.5
0.5
0.9
0.5
2 4
1.0
95% CI
0.5
3.1
1.9
0.4
1.8
0.7
0.4
3 8
1.6
www. biosciencemag.org December2012/Vol. 62No. 12 • BioScience 1069
Education
Table 2. The mean percentage of surveyed graduate student
to the management and analysis of large or complex data.
Course
Spatial analysis
Geographic information
systems
Remote sensing
Modeiing
Time series anaiysis
Meta-anaiysis
Data mining
Aii courses
0 courses
completed
Mean
71.7
54.3
77.2
54.7
8
2.1
91.0
9
1.4
74.6
Abbreviation: CI, confidence interval.
95% CI
7.1
9.4
6.9
7.9
4.1
3.5
3.3
6.0
“The survey stated, “0, but I will take one soon.”
1 course
completed
Mean
19.6
30.5
16.1
29.2
12.1
6.9
4.9
17.1
95% CI
6.1
6.4
5.8
6.3
3.2
3.4
3.0
4.9
; who had iaken or intended to
2 courses
completed
Mean
3.6
7.8
3.7
7.8
3.6
0.7
1.1
4.1
95% CI
2.1
3.5
2.0
2.8
2.2
1.8
1.9
2.3
3 or more
take courses
courses
completed
Mean
2.8
3.5
2.3
5.4
0.7
0.0
0.5
2.2
95% CI
1.3
1.8
1.5
2.1
0.5
0.0
0.6
1.1
in subjects related
Intended future
course’
Mean
2.3
3.9
0.7
2.8
1.5
1.4
2.1
2.1
95% CI
1.1
3.7
0.5
2.1
0.9
0.8
1.9
1.6
Students who had completed their research and had, in fact,
used environmental sensors (33.1%, 95% CI = 10.1). More
than 10% (i.e., 14.9%, 95% CI = 9.8) of the students whose
research was in progress did not know what an environmen-
tal sensor was or what it meant to use it in environmental
research, but this number was halved (7.5%, 95% CI = 0.7)
for the students who had finished their research. There was
no significant difference between the percentage of students
whose research was in progress and who intended to use
a sensor in that research and that of the students who had
completed their research and who actually did use a sensor
(table 3). The doctoral students whose research was still in
progress planned to use environmental sensors significantly
more than did the master’s students, and there was a nearly
significant difference in education level for the students who
had used environmental sensors in their research (p = .0520;
table 4a, 4b). The students at the UC institutions planned
on using environmental sensors in their research (41.9%)
significantly more than did those in private (27.1%) and
CSU-system (28.5%) universities (supplemental table S2).
Interdisciplinary collaboration. The percentage of students who
had collaborated with someone whose expertise was outside
the environmental or ecological sciences was significantly
lower (37.6%, 95% CI = 1.4) than the percentage of stu-
dents whose work was in progress who stated that they had
planned such collaborations (55.4%, 95% CI = 7.5; table 3).
The percentage of students who planned an interdisciplinary
collaboration was significantly larger than that of students
who were finished with their research and actually had done
so (table 3). There was no significant difference in interdis-
ciplinary collaboration activities between the master’s and
doctoral students (table 4a, 4b). There were significant differ-
ences in interdisciplinary collaboration among the students
at different university types who had finished their research
(table S2). Specifically, the CSU students were less likely to
collaborate (28.1%) than were the students at UC institu-
tions (39.8%), who were also less likely to collaborate than
the students at private universities (51.7%).
Data management. Approximately 72.3% (95% CI = 6.2) of
the students who were still in the process of completing
their master’s or doctoral research were planning on com-
pleting the data life cycle in their research, and 65.3% (95%
CI = 6.7) of these students intended to archive their research
data so that it would be available online (table 3). Of those
who had already completed their graduate degree, 63.9%
(95% CI = 16.2) stated that they had completed the data life
cycle, whereas only 29.3% (95% CI =13.1) had made it avail-
able online—significantly less than the prospective figure
from the students still in the midst of their research (table 3).
A large portion of the students stated that they did not plan
on making their data available online, and this number was
greater for the students who had already completed their
thesis or dissertation (45.9%, 95% CI = 1.3) than for those
whose research was still in progress (28.0%, 95% CI = 6.7).
Almost one-third of the students whose research was in
progress did not know what it means to create metadata for
their data sets (28.0%, 95% CI = 8.8), and a similar num-
ber (34.7%, 95% CI = 9.3) did not plan to create metadata
for their data sets. For the students who had finished their
research, 25.6% (95% CI = 1.3) created metadata, 63.2%
(95% CI = 1.7) did not, but 12.0% (95% CI = 1.3) planned
to do so some time in the future.
The students’ data management practices varied accord-
ing to degree type (table 4a, 4b). The doctoral students were
more likely to complete or to plan to complete the data life
cycle. However, the master’s students showed significantly
greater intent to create metadata and to archive their data
products such that it would be available online than did
the doctoral students. There were no significant differences
among the different university types regarding data life cycles
1070 BioScience • December 2012 / Vol. 62 No. 12 www. biosciencemag. org
Education
a NONE
D BASIC
• PROFICIENT
• EXPERT
Percentage
20 40 60 80 100
O
C, C # , C+-H
EML
ENVI
a. GIS (e.g., ArcGIS)
(0
“5
o IDL r
ou
Java
g, JMP, SPSS, SAS
O)
s MATLAB
Ë
2
i”
Access
Python
SQL, MySQL
Figure 1. The level of proficiency of the surveyed graduate students with
programming languages or computational applications. The error bars represent
95% confidence intervals. Abbreviations: EML, Ecological Metadata Language;
GIS, geographical information systems; IDL, Interactive Data Language.
suggest that many of the skills and
practices that would enable scientists
to use these new opportunities are
only marginally itistructed in formal
graduate programs in California in the
environmental sciences.
Environmental curricula: New courses and
skill sets. Students can and do learn new
methods and technologies on their own,
but advanced computation, in situ field
sensor technologies, and digital data
management best practices will only
become standard tools and skills if they
are integrated into formal curricula.
Among the topics that we surveyed, GIS
and modelitig courses were the most
widely studied by the students; About
one-third of them had taken a GIS or
modeling course. Only two other top-
ics in our survey even reached 20%.
This suggests that most environmental
scientists in training are not taking the
initiative to expand their knowledge in
these areas through fortnal courses.
The development of novel courses
requires many resources, including
expertise, time, and funding. In some
cases, it may be worthwhile to inte-
grate new material or skills into existing
courses. However, external organizations
may provide relevant materials that
can be incorporated into an institution’s
curriculum. The DataONE organiza-
tion, for example, develops educational
programs related to data management,
such as internships, workshops at pro-
fessional meetings, and educational
modtiles on specific data management
topics (see www.dataone.org/education
for more information).
and metadata creation (table S2). But students at private
universities (69.5%) and UC institutions (67.0%) were more
likely to make their research data available online than stu-
dents at a CSU institution.
The extent of graduate student preparation
Environmental studies in which new kinds of technology,
computation, data life cycle techniques, and open-source
dissemination are employed hold promise for addressing
many important societal issues, including the measurement
of biodiversity shifts (Kelling et al. 2009) and the assess-
ment of climate change (Graham et al. 2010), but our results
Learning to capitalize on technology. I n
this study, we show that environmental
sensors are irnportant methodological instruments for a large
proportion of graduate students. A limitation of our study is
that we did not assess the levels of complexity in the sensor
setup (e.g., an individual device versus a sensor network) or
in data streams derived from such devices. More complex
scenarios often require that users have knowledge in areas in
which few of the students in our survey had taken courses,
such as data structures and algorithms, database manage-
ment, and networking (table 1). Researchers will also need to
understand how new technologies can be used, their strengths
and limitations, and techniques for analyzing the numerous
and complex data that they output. For example, one must
www. biosciencemag. org December 2012 / Vol. 62 No. 12 • BioScience 1071
Education
a Percentage of students (research completed)
100 90 80 70 60 50
40 30 20 10 10 20 30 40 50 60 70 80 90 100
Created
metadata
Used environmental
sensors
Archived research data
so that it is avaiiabie oniine
Coliaborated with
researcher outside
environmentai science
Compieted the data life cycie
(coilection, management, interpretation, archivai)
100 90 80 70 60 50
Percentage of students (research in progress)
40 30 20 10 10 20 30 40 50 60 70 80 90 100
Complete the data life cycle
(collection, management, interpretation, archival)
Create
metadata
Use environmental
sensors
Archive research data
so that it is avaiiabie oniine
Coilaborate with
researcher outside
environmental science
Figure 2. (a) Mean percentage of responses for the surveyed graduate students (a) who had completed their master’s or
doctoral research or (b) who had not yet completed their master’s or doctoral research. The error bars represent 95%
confidence intervals. The respondents were earning or had earned their master’s or doctoral degree in the ecological
or environmental sciences at a California State University, the University of California, or a private California
university.
Table 3. The mean percentage of surveyed graduate students who responded
that they planned to complete (n = 326) or had already completed (n = 131)
the relevant research steps.
Research project status
Research step
Completion of the data life cycle
Creation of metadata
Use of environmentai sensors
Online archivai of research data
Collaboration with researchers
outside environmental science
In progress
Mean
72.3
37.0
36.7
65.3
55.4
95% CI
6.2
9.3
8.7
6.7
7.5
Completed
Mean
63.9
25.6
33.1
29.3
37.6
95% CI
16,2
13.2
10.1
13.1
13.9
t(455)
3.388
4.361
1.600
16.137
7.366
P
.0008
<.OOO1
,1104
<,0001
<,0001 Abbreviation: CI, confidence interval.
“This value is not significant.
be able to adequately design environ-
mental experiments to support reliable
inferences, regardless of whether one is
using computational technologies or
traditional field-surveying techniques.
The task of designing experiments,
however, becomes more problematic
when students are not well equipped to
integrate new technology and statisti-
cal techniques (Millspaugh and Citzen
2010). Nonetheless, our results sug-
gest that students already value the
integration of technology. Pedagogical
models that address technology in
environmental science and the other
aforementioned concepts and that can
be easily duplicated by instructors are
needed.
1072 BioScience • December2012/Vol. 62 No. 12 www. biosdencemag. org
Education
Table 4a. The percentage of surveyed graduate students who reported that they
have completed the relevant research steps as a function of their education
level for the students who had completed their thesis or dissertation project.
Research step
Completion of the data iife cycle
Creation of metadata
Use of environmental sensors
Online archival of research data
Coiiaborated with researchers
outside environmental science
Education
Master’s
48.0
20.0
12.0
20.0
20.0
level
Doctoral
67.6
26.9
38.0
31.5
41.7
XH2)
7.578
2.057
5.912
1.205
4.133
P
.0226
.3575′
.0520=
.5473=
.1266=
••This value is not significant.
Table 4b. Percentage of graduate students who reported that they planned to
perform the relevant research steps as a function of their education level for the
students who had not yet completed their thesis or dissertation project.
Research step
Completion of the data iife cycie
Creation of metadata
Use of environmental sensors
Online archival of research data
Coiiaborated with researchers
outside environmental science
Education
Master’s
75.9
30.6
30.6
55.5
44.4
level
Doctoral
68.4
26.3
36.0
33.3
43.0
XH2)
6.976
21.459
6.762
21.952
1.3873
P
.0306
<.OOO1
.0340
<.OOO1
.4998
‘This value is not significant.
A place for uncommon collaborations. Collaborations allow
nonexperts to take advantage of new technologies—
particularly when those technologies are still in the devel-
opment stage. Scientists can also consult with computer
science or engineering partners about existing off-the-shelf
tools. Developing interdisciplinary collaborations, however,
can be time consuming and challenging. As Andelman
and colleagues (2004) illustrated, students new to inter-
disciplinary work may not be aware of the challenges
involved in initiating and maintaining such collaborations.
Collaborators must spend time learning each other’s lan-
guage (including jargon), research methods, and expec-
tations, and they must develop schedules, project plans,
and—in the end—research products that satisfy’ everyone
involved. Interdisciplinary projects can be risky from a pro-
fessional point of view, particularly in the early phases of a
research career (Rhoten and Parker 2004). Not incidentally,
theses and dissertations still need to meet the requirements
of the students’ individual disciplines.
Despite these challenges, our survey shows that inter-
disciplinary collaborations by environmental scientists
are important. The obvious benefit of interdisciplinary
collaborations is access to expertise that enhances projects.
Additional benefits are derived when
scientists provide feedback to the devel-
opers of new technologies or work with
developers to ensure that technology is
durable in various field conditions. For
example, new technologies are more
conducive to field studies if they are
adaptive to shifts in environmental and
behavioral phenomena (Collins et al.
2006, Allen et al. 2007, Rundel et al.
2009)—something that environmental
scientists are equipped to assess.
Teaching environmental scientists data man-
agement. As the frequency of collabo-
rations increases, data management
and sharing needs and expectations
grow as well. In addition, complet-
ing the data life cycle—that is, docu-
menting data-collection and analysis
processes, making data available in a
usable format, and submitting data to
a formal data archive—is increasingly
expected by research fanders.
The majority of students in our
survey indicated that they had (or
expected to have) completed the data
life cycle by the completion of their
program. Comparing the responses
of students whose thesis or disserta-
tion was in progress with those of
students who had completed their
project shows some differences. For
example, there were fewer positive responses to all of the
questions from those who had already completed their
thesis or dissertation than from those whose thesis or dis-
sertation was still in progress (figure 2). This might indi-
cate wishful thinking by those who were still working on
their projects or poor follow-through on the behalf of the
students who had finished their projects. Perhaps this is
indicative of the students’ perceptions that they would be
well equipped to tackle the task of completing the data life
cycle at the completion of their project, but the students
found that their formal education in these domains had
been inadequate to motivate the completion of the data
life cycle. The differences in the responses among the stu-
dents from the different kinds of institutions (e.g., the CSU
students were less likely to have completed the data life
cycle and to have archived data) indicate that the doctoral
students from the UC-system and private universities may
have had more time or resources to complete data manage-
ment tasks than did the CSU students.
Previous studies have shown that the difficulties of creat-
ing metadata are some of the biggest impediments to sharing
data (Campbell et al. 2002, Tenopir et al. 2011). Researchers
without metadata expertise must either spend significant
www. biosciencemag. org December2012/Vol. 62No. 12 • BioScience 1073
Education
amounts of time helping outside researchers to understand
their data or must forgo data sharing altogether. Paradoxically,
as collaboration becomes the norm, metadata creation can
become an even bigger challenge. Regardless of the size of
a research project, the loss of metadata can be prevented by
documenting data-collection methods, data transformations,
and analysis steps as they occur and according to metadata
standards (Michener et al. 1997). However, in collaborative
projects, the responsibility for metadata creation is often
not discussed, and knowledge of the different components
of a project is diffused among the collaborators (Mayernik
2011, Wallis and Borgman 2011). Therefore, putting together
integrated documentation from large projects requires broad
coordination and input and specifically assigning the task of
data documentation to particular individuals.
If students are not familiar with metadata practices that
allow their data to be used by collaborators or reused by out-
side researchers, the potential benefits of data sharing across
projects may not be realized. Few metadata-specific courses
for graduate students exist, and most that do exist are offered
in library or information science departments. Therefore,
it is not surprising that very few of our survey respondents
had taken courses relating to metadata (table 1). In fact, of
the activities surveyed in table 1, creating metadata received
the lowest cumulative positive response and had the highest
response for / don’t know what that means right now from
students who were in the midst of their graduate research. In
addition, figure 1 shows that 99% of our survey respondents
had no knowledge of EML, which was formally adopted
by the US Long Term Ecological Research Network as a
metadata standard in 2003. Our survey results suggest that
metadata training and EML are not regular parts of graduate
student education in ecology and environmental science.
Conclusions
Scientists are now uniquely poised to employ in situ sensors,
to create and use open-source data sets, and to capitalize on
other technologies for use in research. However, graduate
students and early-career scientists may need to acquire some
additional skills and knowledge that were not necessary or
available a generation ago. In some cases, immediate advis-
ers and mentors provide this background. However, as our
survey results indicate, many of the requisite newer skills and
knowledge are not being obtained through coursework or
instruction. Institutions responsible for equipping environ-
mental scientists with the tools necessary for success have
a challenge ahead of them. Frameworks and educational
models addressing the concepts discussed in this article that
can be replicated and tested for efficacy are much needed.
The move toward competence may also be accomplished
unconventionally through consultation with people with
such skills, such as a technician or expert on a critical instru-
ment or someone with a new statistical tool (see box 1 for
more information). Students may need to go beyond the
instruments and analytical approaches used in their immedi-
ate labs.
Regardless of available training, graduate students or
early-career scientists may doubt their ability to integrate
such technology into their own research or to create data sets
that can be repurposed by other researchers. Integrating tech-
nology and creating reusable data collections may require
additional effort up front, but long-term rewards will result
for the individual scientist and for the scientific and public
communities. For example, many data archives are now
requesting that data users include data citations in their pub-
lications (e.g.. Cook 2008). The goal of such data citations
is to make the data as valued in scientific settings as peer-
reviewed publications, although this type of work is not yet
often considered in the evaluation processes of most tenure
and promotion committees.
How do the environmental sciences move forward to help
students take advantage of advanced technology and data
repositories? Students and educators can contribute in differ-
ent ways. Graduate students can be proactive in investigating
new opportunities, including spending time exploring in situ
sensor devices or networked systems currently available to
determine whether they are appropriate for the study’s objec-
tives. They must also consider that other scientists may have
similar needs for new equipment. Software that might elimi-
nate bottlenecks in data workflow or automate data entry or
processing should be investigated. Database structures, data
workflows, and metadata requirements should be established
at the beginning of a project—before data collection. Metadata
routines should be established during data collection, man-
agement, and interpretation. Options for long-term data
archiving should be investigated. Research centers and univer-
sity libraries may have data and metadata archiving options
that may help reach the target audience. Relevant seminars,
symposiums, and conferences should be sought in other
disciplines, such as computer science and engineering. These
gatherings provide opportunities to share innovative ideas and
to meet prospective tech-sawy colleagues (who might be keen
on using their skills for environmental research).
From the educator’s point of view, how should an environ-
mental science curriculum evolve to provide graduate stu-
dents with the skills necessary to use new computational and
data tools? Environmental science programs within different
institutions will be situated to approach this question in dif-
ferent ways, but some promising approaches include the fol-
lowing: Classes or academic programs can be developed that
are cross-listed among disciplines or cotaught by instructors
from various disciplines. Cross-listed courses introduce
students to new research methods and techniques, as well
as to students from other disciplines. Ever-improving online
instruction tools are also making interinstitution courses pos-
sible, as is exemplified by a new distributed program in land-
scape genetics developed by Wagner and colleagues (2012).
The list of courses that meet students’ methods requirements
could include computer and information science courses,
lust as many students take statistics courses, they might also
benefit from these. Student-focused workshops can be cre-
ated with data or computational themes. Workshops can
1074 BioScience • December 2012 / Vol. 62 No. 12 www. biosciencemag. org
Education
Box 1 . Selection of Web sites offering information and resources in advanced
technologies and data management practices in environmental science.
The following lists are not exhaustive but will serve as a springboard for those inter-
ested in learning more about the eScience research community and its services.
Research centers
The Center for Embedded Networked Sensing (http://research.cens.ucla.edu) and its
Urban Sensing program (http://urban.cens.ucla.edu)
DataONE (The Data Observation Network for Earth; https://dataone.org)
NEON (The National Ecological Observatory Network; http://www.neoninc.org)
The National Center for Ecological Analysis and Synthesis (www.nceas.ucsb.edu)
The South African Environmental Observation Network (www.saeon.ac.za)
The Knowledge Network for Biocomplexity (http://knb.ecoinformatics.org)
Ecoinformatics’ online resource for managing ecological data and information (www.
ecoinformatics. org)
Oregon State University’s Eco-Informatics Summer Institute (http://eco-informatics.
engr. o regonsta te. edu)
The University of Washington’s eScience Institute (http://escience.washington.edu)
Selected projects
What’s Invasive! Comrnunity Data Collection (http://whatsinvasive.com)
Trash I Track (http://senseable.mit.edu/trashtrack)
Project BudBurst (http://neoninc.org/budburst)
Urban Sensing Projects (http://urban.cens.uch.edu/proiects)
Biketastic (http://biketastic.com)
Data management and metadata resources
DM?Too\ (http://dmp.cdlib.org)
Ecological Metadata Language (http://knb.ecoinformatics.org/sofiware/eml)
The Dublin Core Metadata Initiative (www.dublincore.org)
The Kepler Project (https://kepler-project.org)
Online data repositories
DataONE (www.dataone.org)
The Dryad data repository (www.datadryad.org)
Global Population Dynamics Database (www.imperial.ac.tik/cpb/gpdd/secure/login.
aspx?Return Url= %2fcpb %2fgpdd%2fgpdd-b. aspx)
The Interaction Web DataBase (www.nceas.ucsb.edu/interactionweb)
The US Long Term Ecological Research Network Data Portal (http://metacat.lternet.edu)
Metacat (http://knb.ecoinformatics.org/knb/docs)
The Paleobiology Database (www.paleodb.org)
DataUp (http://datapub.cdlib.org)
Vegbank (http://vegbank.org)
provide intensive environments in which to learn particular
methods or technologies and might be easier to organize and
implement than new or cross-listed courses (e.g., Andelman
et al. 2004). Students can be encouraged to look into intern-
ship programs, such as DataONE’s Summer Internships
Program (www.dataone.org/internships). As the NSF and
other funding agencies continue to promote data-intensive
research collaborations, the possibilities for relevant intern-
ships will probably increase.
Curriculum and culture changes related to advanced tech-
nologies and data management in environmental science
are certain to be gradual. As more examples appear of
the ways in which the increased use of new technologies
and increased attention to data management can benefit
individual environmental scientists, stu-
dents’ interest in these tools and tech-
niques is likely to increase. To serve
as an example of our own cause, we
have archived the data we collected in
this study in the Dryad data repository
(see Hernandez et al. 2012). The Dryad
data repository accepts “data files asso-
ciated with any published article in the
biosciences, as wel] as software scripts
and other fi]es important to the article”
(http://datadryad.org/depositing). We
prepared our data for deposit while
the present article was being reviewed.
In preparing our data for submission,
we followed the recommendations pro-
vided on Dryad’s “Depositing data to
Dryad” Web page (http://datadryad.org/
depositing). Dryad’s page provides rec-
ommendations on file names and data
file documentation and suggestions for
standardization. Because our data are
survey data, we also followed the data-
deposit recommendations provided by
the Inter-University Consortium for
Political and Social Research (www.
icpsr.umich.edu/icpsrweb/deposit/index.
jsp), the largest archive of quantita-
tive social science data. Data sets from
environmental studies might be best
prepared for deposit according to the
recommendations provided by the Oak
Ridge National Laboratory Distributed
Active Archive Center (DAAC) for
Biogeochemical Dynamics (Hook et al.
2010).
As more scientists become accus-
tomed to documenting and archiving
their data in long-term data archives,
more data will be available for reuse.
Unless metadata becomes a more salient
topic within environmental science edu-
cation, however, these archived data sets may be of highly vari-
able utüity. Some data archives, such as the Oak Ridge DAAC,
provide support for metadata creation. Data archives that use
a model similar to self-publication, such as the Dryad reposi-
tory, require the data creators to create and deposit metadata.
The process of documenting data sets for use by someone
else is different from the process of documenting data sets for
one’s own use, even though descriptions of many of the same
project aspects are involved, such as annotations of methods,
sampling processes, and errors or uncertainties. Outside users
require considerably more in-depth and detailed metadata
descriptions.
Metadata standards are crucially important in integrating
data from individual projects. In our survey, we investigated
wvm>. biosciencemag. org December2012/Vol. 62No. 12 • BioScience 1075
Education
only one standard: EML. Not all data-sharing systems for
ecological and environmental data use EML; many other
metadata standards exist, both general-purpose standards
and topic-specific standards, such as Darwin Core for biodi-
versity data (Wieczorek et al. 2011) and Federal Geographic
Data Committee standards for geospatial data. However,
the lack of metadata training indicated by our survey sug-
gests that students will come to any metadata standard
unequipped with basic knowledge of how to create metadata
following such standards.
With the increased attention to computational and data-
intensive science by federal funding agencies, universities,
and the general public, new curriculum initiatives in the
environmental sciences might be well received by both insti-
tutions and students. Building new tools and techniques
into educational curricula is essential to enabling individual
scientists and large collaborative groups of scientists to solve
Earth’s environmental problems in this data-intensive age.
Acknowledgments
This research was funded by National Science Foundation
grant no. EF-0410408 and Center for Embedded Networked
Sensing grant no. CCR-0120778.
References cited
Allen MF, et al. 2007. Soil sensor technology: Life within a pixel. BioScience
57: 859-867.
Andelman SJ, Bowles CM, Willig MR, Waide RB. 2004. Understanding
environmental complexity through a distributed kjiowledge network.
BioScience 54: 240-246.
Benson BJ, Bond BJ, Hamihon MP, Russell MK, Han R. 2010. Perspectives
on next-generation technology for environmental sensor networks.
Frontiers in Ecology and the Environment 8: 193-200.
Campbell EG, Clarridge BR, Gokhale M, Birenbaum L, Hilgartner S,
Holtzman NA, Blumenthal D. 2002. Data withholding in academic
genetics: Evidence from a national survey. Journal of the American
Medical Association 287: 473-480.
Collins SL, et al. 2006. New opportunities in ecological sensing using wireless
sensor networks. Frontiers in Ecology and the Environment 4:402^07.
Cook R. 2008. Citations to published data sets. FluxLetter 1: 4-5.
(20 September 2012; http://hwc.berkeley.edu/FluxLetter/FluxLetter-
VoU-No4 )
Douglass JA. 2007. The California Idea and American Fligher Education:
1850 to the 1960 Master Plan. Stanford University Press.
Graham EA, Riordan EC, Yuen EM, Estrin D, Rundel PW. 2010. Public
Internet-connected cameras used as a cross-continental ground-
based plant phenology monitoring system. Global Change Biology 16:
3014-3023. doi:10.111 l/).1365-2486.2010.02164.x
Hernandez RR, Mayernik MS, Murphy-Mariscal ML, AJlen MF. 2012.
Data from: Advanced Technologies and Data Management Practices in
Environmental Science: Lessons from Academia. Dryad Data Repository
http://dx.doiorg/W.506l/dryad.cv86385c
Hey T [ AJG], Trefethen AE. 2003. The data deluge: An e-Science perspective.
Pages 809-824 in Berman F, Fox GC, Hey AJG, eds. Grid Computing:
Making the Global Infrastructure a Reality. Wiley.
Hook LA, Santhana Vannan SK, Beaty TW, Cook RB, Wilson BE. 2010. Best
Practices for Preparing Environmental Data Sets to Share and Archive.
Oak Ridge National Laboratory Distributed Active Archive Center.
doi:10.3334/ORNLDAAC/BestPractices-2010
Kelling S, Hochachka WM, Fink D, Riedewald M, Caruana R, Ballard G,
Hooker G. 2009. Data-intensive science: A new paradigm for biodiver-
sity studies. BioScience 59: 613-620.
Mayernik MS. 2011. Metadata Realities for Cyberinfrastructure: Data
Authors as Metadata Creators. PhD Dissertation. University of
California, Los Angeles. doi:10.2139/ssrn.2042653
Michener WK, Brunt JW, Helly JJ, Kirchner TB, Stafford SG. 1997.
Nongeospatial metadata for tbe ecological sciences. Ecological
Applications 7: 330-342.
Millspaugh JJ, Gitzen RA. 2010. Statistical danger zone. Frontiers in Ecology
and the Environment 8: 515.
Nature. 2003. Who’d want to work in a team? Nature 424: 1.
[NSF] National Science Foundation. 2010. Scientists Seeking NSF Funding
Will Soon Be Required to Submit Data Management Plans. NSF.
(20 September 2012; www.nsfgov/news/news_summ.jsp?cntn_id=116928)
Peters DPC, Groffman PM, Nadelhoffer KJ, Grimm NB, Collins SL,
Michener WK, Huston MA. 2008. Living in an increasingly con-
nected world: A framework for continental-scale environmental science.
Frontiers in Ecology and the Environment 6: 229-237.
Porter J, et al. 2005. Wireless sensor networks for ecology. BioScience 55:
561-572.
Regan HM, Colyvan M, Burgman MA. 2002. A taxonomy and treat-
ment of uncertainty for ecology and conservation biology. Ecological
Applications 12: 618-628.
Rhoten D, Parker A. 2004. Risks and rewards of an interdisciplinary research
path. Science 306: 2046. doi: 10.1126/science.l 103628
Rundel PW, Graham EA, Allen MF, Fisher JC, Harmon TC. 2009.
Environmental sensor networks in ecological research. New Phytologist
182: 589-607. doi:10.1111/i.l469-8137.2009.02811.x
Tenopir C, Allard S, Douglass K, Aydinoglu AU, Wu L, Read E, Manoff M,
Frame M. 2011. Data sharing by scientists: Practices and perceptions.
PLOS ONE 6 (art. e21101). doi:10.1371/journal.pone.0021101
Wagner HH, Murphy MA, Holderegger R, Waits L. 2012. Developing an
interdisciplinary, distributed graduate course for twenty-first century
scientists. BioScience 62: 182-188. doi: 10.1525/bio.2012.62.2.11
Wallis JC, Borgman CL. 2011. Who is responsible for data? An exploratory
study of data authorship, ownership, and responsibility. Proceedings
of tbe American Society for Information Science and Technology 48:
1-10.
Wallis JC, Mayernik MS, Borgman CL, Pepe A. 2010. Digital libraries for
scientific data discovery and reuse: From vision to practical reality.
Pages 333-340 in Hunter J, Lagoze C, Giles L, Li Y-F, eds. Proceedings of
the 10th Annual loint Conference on Digital Libraries. Association for
Computing Machinery doi:10.1145/1816123.1816173
Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, Giovanni R,
Robertson T, Vieglais D. 2012. Darwin Core: An evolving community-
developed biodiversity data standard. PLOS ONE 7 (art. e29715).
doi: 10.1371/journal.pone.0029715
Rebecca R. Hernandez (rehecca.hernandez@stanford.edu) is a doctoral student
in environmental Earth system science at Stanford University, in Stanford,
California. She studies plant and soil ecological processes using sensor tech-
nologies and computational tools. Matthew S. Mayernik is a research data
services specialist at the National Center for Atmospheric Research, in Boulder,
Colorado. He received his PhD in information studies from the University of
California, Los Angeles, in 2011. He studies data and metadata management
practices across scientific disciplines. Michelle L. Murphy-Mariscal is a research
scientist and Michael F. Allen is the director at the Center for Conservation
Biology at the University of California. Riverside. MLM-M uses imaging tech-
nologies to sttidy corridor ecology in southern California, and MFA leads the
Terrestrial Ecology Observing Systems program for the Center for Embedded
Networked Systems Consortium.
1076 BioScience • December2012/Vol. 62No. 12 www. biosciencemag. org
Copyright of BioScience is the property of American Institute of Biological Sciences and its content may not be
copied or emailed to multiple sites or posted to a listserv without the copyright holder’s express written
permission. However, users may print, download, or email articles for individual use.