Skip to content
Penn State University Libraries

Digital Toolkit

Digital Textual Objects

Digital Textual Objects: Policies and Guidelines

Definition

Digital textual objects may be monographs, serials, manuscripts, etc. They are comprised of:

  • Text scanned from printed pages
  • Text originating in digital form ("born digital"). This is textual content created without scanning or OCR. It may have been hand-keyed, may be rendered into pages through a page composition package, or may be continuous, even unformatted, text.
  • Textual components of non-textual items as applicable

Items excluded from this definition and subsequent standards are Newspapers and Electronic Course reserves.

 

Paper Image Quality

All page images with the exception of Metalmark and Penn State Press Material

  • Archival: 600 dpi color/grey/b&w, uncompressed TIFF
  • Working: 300 dpi color/grey/b&w, uncompressed TIFF
  • Display: 72-150 dpi color/gray/b&w, JPG/PNG/GIF (exceptions allowed)
  • Thumbnails: 72 dpi, color and dimensions may be limited according to preference of web site.

Metalmark and Penn State Press Material page image

  • Working: 600 dpi color/grey/b&w, uncompressed TIFF
  • Display: 72 dpi 3 bit gif
  • Thumbnails: 72 dpi, color and dimensions may be limited according to preference of web site.

 Penn State Press Born Digital page images

  • TBD - example from Romance Studies
Top
 

OCR

OCR text will be accepted as is after processing, assuming 80% accuracy and verified by content manager.   (CONTENTdm OCR used for CONTENTdm material; PrimeOCR or vendor supplied OCR for materials loaded into other software platforms)

Exception: post-OCR error correction and formatting will not be performed unless OCR enhancement was planned and budgeted for in the planning phase of the project.

Display: OCR text will not be displayed for digital text. 

Exceptions will need to be made ADA compliance/requests. Method will need to be determined for general material. Delivery of text of Penn State Press material will need to be investigated with the Press.
In the future full text may be provided in alternative formats for text readers (Palmer Reader, MS Reader, etc.) Method and exceptions to be determined.

Top
 

Metadata

Descriptive metadata - level of description required for textual digital objects

  • collection - highest order of description for a given digital textual object. Determined at the onset of the project. Collection = group of individual works or a single work when that is the extent of the digital object.
    Metadata record created by cataloging, using content standards such as AACR, LCSH
  • work - a single work within a collection, may be the same as the collection when the collection represents a single work.
    metadata inherited from collection level
    record created by cataloging, as above
  • standard when digitizing a resource that has already been cataloged
    exception in other case
  • intermediate structural levels based on "title" of the volume, part, chapter, section - optional - determined by the content manager and Digital Technology Advisory Group at onset of project. '
    Romance Studies materials - markup at chapter level unless parts are present. When this is the case, markup at part and chapter level.
  • page - no page-level descriptive metadata
  • material type assigned at page level for specific content exceptions(e.g., map within a book, plates, etc.) Must be determined at onset of project otherwise this will be treated as an exception.
  • full text metadata is hidden by default (available for searching but not displayed)
  • rights metadata (ownership, access, use) is assigned at the collection or work level, and is inherited by each page or content part so that there is an explicit statement of rights in each digital file. Standard: rights metadata is the same for all pages within a collection or work.
    Penn State Press books may vary rights metadata within a collection or work due to copyright limitations, etc.
  • Technical metadata Technical characteristics of page images include:
    • filename
    • file format
    • file size


Workflow
Cataloging - collection level metadata created before page images processed.

Exception - when collection does not exist as an entity (e.g., not a book) Cataloging will have access to the collection and its contents prior to cataloging.

Administrative metadata

The standard practice includes external (to software delivery platform) tracking of

  • decisions that have been made about the collection: selection criteria, acquisition information
  • work that has been done to the collection, work and page(s)
  • tracking and storage of this information is the responsibility of DOT.
Top
 

Digital Reproduction Type

Standard- page images with OCR text available for search but not display

  • Blank pages must be included and will be represented with one reusable blank page image file if blanks not scanned and included in image fileset.
    Exceptions allowed in special cases and for previously digitized material
    No associated text file will be generated.
  • Unnumbered pages and irregularly numbered pages will be included and numbered according to the numbering scheme allowed by the delivery software.
  • Missing pages: missing pages will be located and replaced before digitization. When a page cannot be replaced, a standard filler page will be used. Associated text file with standard statement will be used.
Top
 

Searching and Browsing

Cross-collection searching by internal/external search engines
Whenever possible and allowed by the presentation software, the metadata of a collection will contain collection name, format and other descriptive keywords to allow cross collection searching, limits, etc.

If the collection has specific metadata features that are not functional with the cross-collection search, a collection-specific search may be developed when allowed by the presentation software.

Exception - custom programming to develop such a search.

If the presentation software does not allow for cross-collection searching, then it shall be permissible to require the patron to perform additional and/or redundant searching and browsing, pending administrative approval.

Top
 

URIs and Persistent URIs

A PURL will be assigned to each collection or work and will have the following address: http://resource.libraries.psu.edu/[collection name]/
This must be determined during the planning phase of the projects.

The collection name should be short, succinct, and clearly distinguishable from other collections. The name cannot contain special characters except for the underscore and cannot contain spaces. Names should be all lower case. PURLs will be set at the start of the project before any work is performed and activated by I-Tech. Once activated, PURLs cannot be altered or deleted.

Exception: Additional PURLs may be allotted on rare occasions, but all prior PURLs for the collection will remain in effect.

Top
 

File Naming

See File Naming Standards

File naming for presentation: filenames must be unique, consecutive, and identifiable as being part of the whole text (e.g, [cataloging ID/OCLC #/ISBN #]_[incremental file counter].[file extension] i.e., "mountainstories_001.txt"). The underscore must be used immediately before the incremental file counter and can only be used in that position. Filenames must be all lower case and cannot contain spaces or non alpha/numeric characters.
Top
 

Web Entry Point

Standard: A Web entry point ("Splash Page") that provides a basic description of the collection/work will be created for each object. Search functionality and/or links will be provided to works within a collection if structural metadata and presentation software allows.

Exception - additional programming, enhanced web features.

Top