Frequently-Asked Questions About Data Sets
What is a Data Set?
A dataset is a compilation of data elements which represent the characteristics of a systematically drawn sample of observations. There are two types of datasets - primary or secondary.
Primary Datasets
A primary dataset is collected and compiled by the researcher for the purpose of addressing a specific research question.
Secondary Datasets
A secondary dataset is collected and compiled by either another individual or agency. Secondary datasets can consist of public use files, restricted access files, or a combination of both. Examples of secondary data sets include the Health Care Financing Administration's Medicare Current Beneficiary Survey (which profiles the demographic characteristics, health status and functioning, access to care, sources of and satisfaction with care, insurance coverage, financial resources, and family supports of Medicare beneficiaries), the National Opinion Research Center's General Social Survey (which focuses on various topics such as the role of government, sociopolitical participation, social networks, religious socialization, etc.), or the National Center for Health Statistics' National Health Interview Survey (which provides information about the amount and distribution of illness, its effects in terms of disability and chronic impairments, and the kinds of health services people receive).
Can file extensions identify the dataset type?
Sometimes the filename extension will identify a specific dataset, but not always uniformly. An extension could apply to multiple types of datasets. Operating systems use different labels for the same data and some datasets are not always labeled. Some common extensions are as follows:
.txt - ASCII data or documentation
.dat - usually ASCII, can be a variety of formats, including EBCDIC
.xls - Microsoft Excel Spreadsheet
.dbf - Dbase II,III, or IV format
.por - SPSS portable format
.sav - SPSS pc binary, (also SST binary)
.ssp - SAS transport file
.ssd/.sd2 - SAS for pc/windows
.sda, .dta - Stata
.ebc, _ebcdic - EBCDIC binary format
What is a codebook?
Generically, any information on the structure, contents, and layout of a data file. Typically, a codebook includes: column locations and widths for each variable; definitions of different record types; response codes for each variable; codes used to indicate nonresponse and missing data; exact questions and skip patterns used in a survey; and other indications of the content of each variable. Many codebooks also include frequencies of response. Codebooks vary widely in quality and amount of information included.
How to access Codebooks
The codebook provides the user with the information necessary to access and analyze a dataset. It is usually necessary to review the codebook to determine whether the dataset will provide the information you need.
In some cases the Penn State Libraries will have a paper copy of the codebook accompanying a dataset. To see if a specific codebook is available in the library search The CAT. Use key words from the name of the study to locate a codebook.
In cases where a keyword search yields several similar entries you must look through the records for the one that indicates the type of file is "data". For example, a keyword search for the Medicare Current Beneficiary Survey yields several similar entries, and the user will have to determine which of these is the appropriate one. In the case of the Medicare Current Beneficiary Survey, the appropriate entry would be the one that reads "Medicare Current Beneficiary Survey". Any codebooks associated with the data will be listed on the record and can be checked out like any other book.
In some cases data sets are owned by the library on CD-ROM or floppy disk. These data sets are cataloged and can be located through the CAT. The CAT record will note if there is a print codebook available. In some cases the codebook is available on the CD-ROM along with the ata itself.
How do I get help with dataset software packages?
For assistance getting started, make an appointment with the Social Sciences Librarian that oversees the dataset program.
Can I use an asterisk in the search to match partial words?
No, because our search engine does this automatically. Search terms are stemmed, meaning that partial-word matches will appear in search results.
What's the difference between phrase searching and word searching?
When you select word searches, the search engine looks for the words independently of one another. Phrase searching looks for the words together. For example, searching for "labor force" as a phrase will ensure that you don't get search results relating to pregnancy. Phrase searching is best used when one of your search terms is very common, but its context within the phrase is rather specific.
Please note that in typical Boolean search engines, this would be accomplished by use of quotation marks in your query. Quotation marks will not work with our search engine, however. The search engine will ignore quotation marks.
When searching, should I type my terms in uppercase or lowercase letters?
Searching is not case sensitive when your search query is in lowercase. Searching is case sensitive when the query term contains capital letters.

(e-reference)