ALCTS - Association of Library Collections & Technical Services

Final Report (continued)


Table of Contents

Executive Summary

Introduction
Metadata and Cataloging
The TEI Header and the Cataloging Rules
Dublin Core Metadata and the Cataloging Rules
Encoded Archival Description: Summary Report
Conclusions
Recommendations
Bibliography

Appendix: Cataloging Problems with Web Sites


Dublin Core Metadata and the Cataloging Rules

By John Attig, Pennsylvania State University Libraries

Background

“The Dublin Core is a 15-element metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has also attracted the attention of formal resource description communities such as museums and libraries.” [Dublin Core Metadata home page – http://purl.oclc.org/metadata/dublin_core/]

The Dublin Core (DC) is designed for maximum simplicity and flexibility. It is expected that DC metadata will be provided by the creators or distributors of the resources themselves, perhaps by filling in a form in their authoring program. On the other hand, Dublin Core can be qualified and extended to meet the requirements of a variety of users. It is theoretically possible to encode most of a fully standard AACR2 description in DC metadata elements, and it is anticipated that some metadata creators will do exactly that. However, the Dublin Core is directed at a broader and less exacting set of resource producers, and the content of a typical set of DC metadata is likely to be less full and less rigorous in its content.

The principles governing use of the Dublin Core elements are simple and straightforward. All elements are optional; all elements are repeatable; order of elements is optional; and, all elements can be qualified by  language (language of the metadata) or scheme (authority or standard for the content).

The following comments are based in part on a collection of examples prepared for the Task Force by a number of contributors.

Our object here is to evaluate Dublin Core metadata as a source of cataloging data for records based on the Anglo-American cataloguing rules. The Task Force recognizes that metadata in general and the Dublin Core in particular have applications other than AACR2-based cataloging records. Indeed, it is arguable that Dublin Core metadata might be applied most effectively in a system designed specifically to support its use, rather than in library cataloging databases. This is one of the questions that this report will explore.

On the other hand, if library cataloging databases are to contain records for Web resources (as it is certain that they will), Dublin Core metadata contains a wealth of information that can be used in those records. We will evaluate the kinds of information that each Dublin Core element may contain and indicate how that information can be used in preparing an AACR2-based cataloging record.

Finally, we will discuss the rules in Chapter 9 of AACR2 and make recommendations about the need for changes to those rules to support the use of metadata as a source of cataloging data.

Note: Much of the argument here will make use of a distinction between metadata as a source of information and metadata as a source of cataloging data. “Source of information” is a technical term in AACR2, referring to a source from which information is transcribed in various elements of the cataloging record. In order to contrast with this technical terminology, we have used “source of cataloging data” to refer to factual data on which various elements in the cataloging record may be based; it is thus a much broader concept which includes not only exact transcription or quotation, but summarization or reformulation of the factual information by a cataloger.

Dublin Core Metadata’s Support of the Four User Tasks

Dublin Core metadata supports the four user tasks set forth in Functional Requirements for Bibliographic Records to varying degrees, but its lack of established rules and procedures governing the content of data elements makes Dublin Core elements less reliable than cataloging data. The explicit simplicity of the element set and the fact that all elements are optional also undermines the reliability of Dublin Core metadata. The following discussion notes the relevance of Dublin Core elements for each of the user tasks.

  • FIND: Dublin Core metadata is designed primarily to support the discovery or finding of electronic resources. The elements are intended to be the most significant pieces of information by which a user might seek an electronic resource. The elements include the TITLE, the CREATOR (author, etc.), OTHER CONTRIBUTORS, SUBJECT – elements that are likely to be primary search categories. There are other elements that are likely to be secondary or restricting features of a search (LANGUAGE, COVERAGE, FORMAT).

    Although there are only limited requirements about the content of these elements, the content may be optimized in the same manner as cataloging data. For example, the content of the metadata elements may be literally identical with the same information shown in eye-readable form on the resource. And controlled vocabularies and authority control practices may be applied to the content of name (CREATOR, CONTRIBUTOR) and SUBJECT elements. According to the Dublin Core element description, “To promote global interoperability, a number of the element descriptions suggest a controlled vocabulary for the respective element values.” However, this is not a requirement, and the original intent of the Dublin Core – to capture information supplied by the authors or distributors of electronic resources – will probably apply to some extent to most random collections of metadata. Unless the metadata is created as part of a project that is able to impose its own rules for content, only minimal assumptions about the reliability of the data can be made.

  • IDENTIFY: The ability to identify the particular resource retrieved and to distinguish similar resources is not one of the explicit objectives of the Dublin Core. However, there are a set of elements that are related to the “instantiation” of the resource: DATE, TYPE, FORMAT, and IDENTIFIER. Unfortunately, the DATE element only provides a limited ability to distinguish versions. This is one area in which Dublin Core metadata may be inadequate to support the user’s needs.

  • SELECT: Once again, Dublin Core is not intended to provide all the information necessary for a user to make a selection among multiple search results. A certain amount of information is provided. In particular, the resource DESCRIPTION may be of great utility in evaluation the relevance of various resources. The COVERAGE element, if sufficiently precise, may also be useful, as (in fact) may any of the elements, given that user selection can be based on any factor. In general, however, it seems to be the assumption behind the Dublin Core that the entire resource is available for the user’s examination during the selection process. Given networked resources and a reasonably-sized set of search results, it is probably feasible for a user to examine the resources themselves. The larger the result set, however, the less efficient this becomes and the more valuable the metadata surrogates will be. Dublin Core metadata, unlike cataloging data, is not intended as a complete surrogate for the resource, but to the extent that it does represent the resource, it can be used to support selection among resources.

  • OBTAIN: Dublin Core is intended to support discovery and retrieval. In a networked environment, obtaining a resource should be fully supported by the inclusion of an accurate address in the IDENTIFIER element. Most of the effort in this area has gone towards assuring that the identifiers assigned to electronic resources are – and remain over time – accurate.

Both cataloging data and Dublin Core metadata support the four user tasks, although Dublin Core is only designed to support the finding and obtaining of electronic resources. On the other hand, cataloging data is optimized to support all four tasks in ways that cannot be expected of metadata. In particular, the use of controlled vocabularies and the practice of authority control enhance the ability to find, and the principle of transcription and concepts of versioning enhance the ability to identify and select desired resources. Cataloging practices add considerable value to the raw data provided by the resources described in bibliographic records, and this added value is intended to support to ability to find, identify, select and obtain desired resources.

Our cataloging databases are high-quality tools for information retrieval, but they are only as good as the standards that apply. The integrity and consistency of these databases depends on applying the more or less same standards to all records in the database. If a significant portion of the database does not reflect the same level of consistency, the database becomes unreliable. It is therefore damaging to the quality of a cataloging database to include in it records based on Dublin Core metadata unless that metadata was formulated according to cataloging principles and practices. This may be possible for metadata coming from particular projects which have been able to adopt appropriate standards. However, it is not possible for a broad range of Internet resources containing metadata provided by authors or data producers. For such resources, it would be preferable to maintain the metadata-based records in a separate database. The metadata will provide a higher level of accessibility than the Internet itself, but its lack of consistency will not damage the even higher level of quality we have invested so much in providing in our cataloging databases.

Dublin Core Elements as Sources of Cataloging Data

The official definitions of the Dublin Core metadata element set in found in “Description of Dublin Core Elements” – [ http://purl.oclc.org/metadata/dublin_core_elements]. A mapping of DC elements to the USMARC fields is contained in “Dublin Core/MARC/GILS Crosswalk,” prepared by the Network Development and MARC Standards Office at the Library of Congress [ http://lcweb.loc.gov/marc/dccross.html]. The following discussion is based on these sources and discusses the use of DC information in AACR2 cataloging records.

  1. TITLE

    The TITLE element corresponds to the Title Proper (AACR2 9.1B1, USMARC 245$a). The source of information for the Title Proper is the title screen or other eye-readable information. Only when there is no eye-readable information can a title be transcribed from other internal evidence such as metadata in the file header. Therefore, the metadata TITLE will usually need to be compared with the eye-readable title before it can be accepted as the Title Proper. If it is different from the eye-readable title, the metadata TITLE would be recorded as a Variant Title (USMARC 246, with a caption “$iTitle from metadata:”).

  2. CREATOR

    The CREATOR element corresponds to the main and/or added entries. The USMARC Crosswalk maps this element to field 720, field 700 (if a personal name is specified) or field 710 (if a corporate name is specified).

    For an AACR2-based description, the rules in Chapter 21 would need to be applied, and a main entry determined. If the CREATOR (or one of the CREATORS) is determined to be the main entry, a 1XX field would be used. If no name type is specified, the cataloger would have to determine whether the name was a personal or corporate name.

    The content of this element may or may not conform to the rules for form of name in Chapters 22-24, and the name may or may not be consistent with the official form in the national authority file. In order to conform to AACR2 practice, authority work would need to be done. Since the USMARC field by itself does not indicate whether the content of the field is an authorized heading, it is particularly important that authority control procedures be built into any use of CREATOR information in cataloging records.

    It should also me noted that the CREATOR element also corresponds to other AACR2 elements, such as the statement of responsibility (AACR2 9.1F, USMARC 245$c), credits note, etc. Although the DC element is not intended as a descriptive (as opposed to an access) element, the data given in the DC CREATOR element may be very useful in describing the responsibility for creation of the resource. Although it is most likely to be formulated as an access point (e.g., an inverted personal name), it may be transcribed in brackets in the Statement of Responsibility area or in a note.

  3. SUBJECT

    The SUBJECT element may contain various identifiers relating to the subject of the resource, such as keywords or classification notations. The default USMARC mapping is to field 653 (Uncontrolled subject access), although specific fields such as 650 for LC Subject Headings or 050 for LC Classification numbers may be used if the metadata include identification of such subject schemes.

    This element does not involve descriptive cataloging covered by AACR2, but it should be noted that this is not a transcribed element. Therefore, it may be used without further modification. Its usefulness will be determined by the specificity of the scheme identification. In a catalog that uses controlled subject headings and classification, uncontrolled keywords will be less useful than controlled headings and classification.

  4. DESCRIPTION

    The DESCRIPTION element corresponds to a Summary note (AACR2 9.7B17, USMARC 520). As with the SUBJECT element, this is not transcribed data and therefore can be used without modification in a catalog record. The usefulness of the result will depend only on the quality of the abstract or summary.

  5. PUBLISHER

    The PUBLISHER element corresponds to the Name of publisher, distributor, etc. (AACR2 9.4D, USMARC 260$b). The prescribed source for this element in AACR2, like that for the title, gives precedence to eye-readable information, over information in the HTML source code. Therefore, the content of the PUBLISHER element would need to be verified; if there is no eye-readable publisher information, the metadata can be used.

  6. CONTRIBUTOR

    The OTHER CONTRIBUTORS element corresponds to the added entries. All of the points made under the CREATOR element above apply here, including the need for authority work and the use of this information as the basis for statements of responsibility and credits notes.

  7. DATE

    According to the Dublin Core definition, the DATE element contains “a date associated with the creation or availability of the resource. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985.” This corresponds in content (but not in form) to the Date of publication, distribution, etc. (AACR2 9.4F, USMARC 260$c). As with the PUBLISHER element, the prescribed source of information for this element gives priority to eye-readable information. The date would have to be verified and formatted according to 9.4F. It should also be noted that the DC DATE element is not necessarily a date of publication; “creation or availability” can cover a multitude of sins, particularly when applied by non-catalogers.

    Other dates may also be recorded in this DC element, such as the date of last update (which might need to be included in a “Description based on:” note). Since Dublin Core includes little information that explicitly addresses the distinction among versions of the same resource, this element may be the only source of such data, and the information may be decidedly inadequate for this purpose.

  8. RESOURCE TYPE

    According to the Dublin Core definition, the TYPE element contains “the category of the resource, such as home page, novel, poem, working paper, technical report, essay, dictionary. It is expected that RESOURCE TYPE will be chosen from an enumerated list of types.” The USMARC Crosswalk maps this element to a Form/Genre term (USMARC field 655).

    It should be noted that this data is relevant to several AACR2 elements. It is similar to the Designation element in the File characteristics area (AACR 9.3B1, USMARC 256). If the list of designations in 9.3B1 is expanded as a result of the ISBD(ER) harmonization, it will be important that the DC and AACR2 lists not be in conflict. RESOURCE TYPE data may also be relevant to the note on Nature and scope (AACR2 9.7B1a, USMARC 500).

  9. FORMAT

    According to the Dublin Core definition, the FORMAT element contains “the data representation of the resource, such as text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.” The USMARC Crosswalk maps this element to a subfield in field 856 (Electronic location and access).

    In terms of AACR2, this element may contain data that could be included in a note on Nature and scope (AACR2 9.7B1a, USMARC 516) or on System requirements (AACR2 9.7B1b, USMARC 538). The information in the metadata would have to be rephrased when used in a note.

  10. RESOURCE IDENTIFIER

    According to the Dublin Core definition, the RESOURCE IDENTIFIER element contains a “string or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.” The USMARC Crosswalk maps this element to the URL (856$u), although other elements can be used if the appropriate scheme is identified (e.g., ISBN in field 020). Although this is vital information about any Web resource, it is not governed by any AACR2 rules (except 9.8B for the ISBN or ISSN).

  11. SOURCE

    The SOURCE element contains information about “the work, either print or electronic, from which this resource is derived.” Although the USMARC Crosswalk maps this to field 786 (Data source), the data is covered by the note on Edition and history (AACR2 9.7B7, USMARC 500 or 533) and, in the case of serial publications, the note on Other formats (AACR2 12.7B16, USMARC 776). The content of the element may need to be modified to comply with the relevant rules and to assure that the related resource is correctly identified.

  12. LANGUAGE

    The LANGUAGE element corresponds to the Language note (AACR 9.7B2, USMARC 546), as well as to the coded language element in USMARC. The default mapping is to field 546, on the grounds that free-text information is most likely, but, if the USMARC coded scheme is identified, it can be mapped to the coded element in field 008. The content in the Language note may have to be modified to conform to the rules.

  13. RELATION

    The RELATION element contains data about the relation of the resource to other resources. This is a more general version of the SOURCE relationship and, like SOURCE, corresponds to the Edition and history and the Relationships with other serials notes. Again, the content of the element may need to be modified to comply with the relevant rules and to assure that the related resource is correctly identified.

  14. COVERAGE

    The COVERAGE element contains data about the chronological or geographic coverage of the resource. The default USMARC mapping is to field 500, but the data may be appropriate in a variety of notes. It may also serve as the basis for subject descriptors. Certain data producers and archivists have developed fairly detailed standards for defining the coverage, particularly of geo-spatial data, and there are specific USMARC fields for such data. Only general coverage notes are specified in AACR2, under the rule for notes on Nature and scope.

  15. RIGHTS MANAGEMENT

    According to the Dublin Core definition, “the content of this element is intended to be a link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources.” The USMARC Crosswalk maps this element to field 540 (Terms governing use and reproduction note) for which there is no corresponding rule in AACR2.


Next Section