Skip to content
Penn State University Libraries

Documenting data

Sanborn Fire Insurance Map of State College, Pa. June 1922
1922 Sanborn Map of State College/PSU Libraries

Data Management

Documenting your data is an initial step in managing them

Managing data is an integral part of the research process. It can be challenging particularly when studies involve several researchers and/or when studies are conducted from multiple locations. How data is managed depends on the types of data, how data is collected and stored, and how it is used through the length of the study.

The outcome of your research depends in part on how well you manage your data. Managing data helps you as a researcher organize research files and data for easier access and analysis. It helps ensure the quality of your research. It supports the published results of your work and, in the long term, helps ensure accountability in data analysis. Good data management starts with comprehensive and consistent data documentation and should be maintained through the life cycle of the data.

  • Designate the responsibilities of every individual involved in the study.
  • Designate how data will be stored and backed up.
  • Designate how data will be dealt with through each modification of the study.
  • And make sure the rules are followed through!

Data documentation encompasses the following:

  • names, labels and descriptions for variables, records and their values
  • explanation of codes and classification schemes used
  • codes of, and reasons for, missing values
  • derived data created after collection, with code, algorithm or command file used to create them
  • weighting and grossing variables created and how they should be use
  • data listing with descriptions for cases, individuals or items studied, for example for logging qualitative interviews


Back to top

Metadata for describing data

lantern slide showing variety of stamens.
Lantern slide showing variety of stamens / PSU Libraries

Metadata defined

What are metadata? Metadata give context to your research data by providing descriptive detail about it. They offer standardized, structured information explaining data in terms of, for example, purpose, origin, time references, geographic location, creator, access conditions, and terms of use of your data collection. Used to enable resource discovery, metadata can provide pathways for searching existing data; present as a bibliographic record for citation; or facilitate online browsing of data.

Key metadata elements

Deciding on what elements to use to describe your data is one way to start structuring them. Examples of metadata elements are title, contributor, creator, subject, description, type, format, date, relation, identifier. An example of a metadata schema, or element set, is the Dublin Core metadata schema.

What you input for "title," "subject," "format," or for any metadata field, tells something about the data you have collected. An important best practice approach to creating metadata is to use a controlled vocabulary, a standardized terminology for your community of interest (e.g., art history catalogers often use the Getty Research Institute's Art & Architecture Thesaurus). Complying with an accepted standard, such as a controlled vocabulary or an authority list (e.g., Library of Congress Authorities), will help in the retrieval and indexing of your data.

Metadata Consultation Services at Penn State University Libraries? Contact:

Kevin Clair - Metadata Librarian, Cataloging and Metadata Services  814-865-2257

Examples of metadata

An example of a standardized approach to describing botany data can be seen in the Swingle Plant Anatomy Collection - particularly in its data dictionary (which is a guide to types of data in a database) based in Dublin Core.

Another example is this sample GenBank record below - click on the image to go to the source site:


Screen capture of exceprt of GenBank metadata record.
A sample metadata record from GenBank, the NIH genetic sequence database - click on the image to go to the source site.
Back to top


Black and white photo from 1901 showing two men measuring stream flows on the W. Carson River
Measuring stream flows, W. Carson River, 1901/Bureau of Reclamation

Managing your data for the long term

Close attention to storage, back-up, security, and sustainability of your data means you lessen the risks of compromising their quality and accessibility over the long term. This is also a part of data management that likely will entail collaborating with IT staff in your department or campus unit. For guidance on this, see "Questions to Ask As You Prepare a Grant Proposal," from Stanford University.

Storage and back-up

Issues related to storage include considering how rapidly data are expected to increase over the lifetime of the research project. Part of answering this question involves determining whether data will be collected in automated ways, which potentially steps up the scale of data collection, or whether staff on the project will be gathering data themselves (e.g., via inputting in a database, or a lab notebook). Options for short-term storage include hard disk drives and portable media (e.g., DVDs and CDs).

Server storage possibilities at Penn State

Penn State offers a range of server storage options - some free of charge, others fee-based. These services are generally for short-term storage. They are not recommended for long-term archiving or curation of digital data and content. Contact the Research Data Management Services Team for guidance on archiving and other curation issues.

All Penn State students, faculty, and staff can apply for personal Web space on the Personal server. Students, faculty and staff who request Personal Web space will get a www folder/directory added to their respective PASS folders. Space allocations for this service no longer apply due to the 10GB maximum quota available to PASS users.

TSM is a fee-based service. It acts as a file backup and archive server for the disk drives of any workstation or personal computer connected to the Internet. TSM runs as a server on the IBM RS/6000 SP under the AIX operating system In addition, TSM supports 25 different platforms as clients and offers disaster recovery and Hierarchical Storage Management (HSM).

TSM is available to Penn State faculty, staff, and departments.

Many academic departments, units, and schools/colleges at Penn State also have storage options - check with your affiliated department or unit. Penn State colleges, departments, and official units are welcome to purchase ITS Web space. An ITS Charge Account (formerly a P-Account) must be established for charging purposes.

For more information about storage services at Penn State, see Information Technology Services (ITS) Account Services

Back to top

Security and Sustainability

color illustration of a tree with roots exposed to accentuate the notion of sustainability

Data Security

Ownership is key: It is important for researchers to understand the relevant ownership rules for any data that they collect or use. From an ethical standpoint, researchers should consider the implications of data ownership agreements before they are made with other researchers, institutions, or funding agencies.

Typically, when research is funded by federal or nonprofit granting agencies, the data are owned by the institution receiving the grant. The primary researcher or scholar receiving the grant has the responsibility for storage and maintenance of the data, including the protection of confidential or sensitive information.

Data obtained through research supported by private or corporate funding, however, may have different guidelines for ownership and restrictions on sharing. This issue is further complicated when organizations such as universities patent data sets.

Confidential / Sensitive and Proprietary Data

Scholars and researchers have a moral and professional responsibility to ensure that confidential or sensitive data is stored and released in a way that protects research participants. For example, the “Privacy Rule” of the federal Health Insurance Portability and Accountability Act (HIPAA) advises on maintaining confidentiality for research data that comes from health care records; HIPAA calls for specifications of data handling responsibilities and privileges.

Data that include confidential or sensitive information, if properly cleaned, can still be shared by following certain guidelines:

  • withhold part of the data
  • statistically alter the data in ways that will not compromise secondary analyses
  • require researchers who seek data to commit to protect privacy and confidentiality
  • provide data access in a controlled site, sometimes referred to as a data enclave.

Sustainability and Data Formats

Data must be archived in a controlled, secure environment in a way that safeguards the primary data, observations, or recordings. The archive must be accessible by scholars analyzing the data, and available to collaborators or others who have rights of access. Primary research data should be stored securely for sufficient time following publication, analysis, or termination of the project. The number of years that data should be retained varies from field to field and may depend on the nature of the data and the research.

Sustainable data management is crucial to the value of research and crucial to ensuring continued scholarship.Typically, in data storage, there is a an access copy, for use, and an archival copy, essentially for preservation and back-up purposes. Backing up data cannot be overemphasized, just as natural disasters and breakdowns in systems and software cannot be predicted. Back up your data early and often!

Choosing data formats and software depends mostly on the preference of the researcher but can often be dictated by discipline-specific standards and customs. While ensuring the long-term usability and sustainability of data requires attention to standard and interchangeable software, there are also Preferred Formats (from the UK Data Archive) for data creation and preservation. 

For more information about selecting data formats and software with respect to sustainability, see "Sustainable Data Formats" (University of Wisconsin-Madison).


Back to top