Skip to content

About Foreign Language Searching and the CAT

What is Unicode?

Unicode is a standard for encoding characters for use by computers.  It includes virtually every character in every language and script, as well as symbols (e.g., numbers, punctuation, mathematical and musical symbols). Use of Unicode characters “supports the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world.” For more information, see http://www.unicode.org/

Why support Unicode in The CAT?

The collections of the Penn State University Libraries contain diverse materials in a wide variety of languages and scripts. Penn State faculty, staff, and students expect to see these languages correctly represented in our catalog, and to be able to search in languages and scripts appropriate to their research and instructional needs.

Warning: The amount of data in The CAT is currently rather limited.

Unicode is a relatively new standard, and the data in The CAT was created over decades. At this time, the amount of Unicode data is rather limited, even for relevant materials. For example, although the University Libraries has strong collections in Russian and other Slavic languages, the descriptions of almost all of this material exists only in transliteration; there is very little information in the Cyrillic alphabet. On the other hand, the descriptions of materials in Chinese, Japanese, and Korean typically do have information in these scripts. However, users should be aware that a search using a non-Latin script will not retrieve all relevant material.

How can I see Unicode characters in The CAT?

On most computers, Unicode characters should display correctly without any additional setup. If you encounter problems, you may need to install a Unicode font on your computer. Windows computers will need the Arial Unicode MS font, and Macintosh will need Lucida Grande. For more information see the Library of Congress page "Displaying and Searching Diacritics and Special Characters."

For non-Latin scripts, the transliterated and non-Latin information will display on adjacent lines for each data element; the full description in both Latin and non-Latin scripts appears on the “more” tab of the “details” display.

How can I search The CAT using Unicode characters?
Searching with diacritics

Characters with diacritics (å, ë, í, ñ, etc.) or other special characters can be searched as regular characters (i.e., without the diacritics). A search with or without diacritics should produce the same result. For certain special characters, you can type a regular character as a substitute (e.g., a regular L instead of a Polish ?, a regular O instead of a Scandinavian Ø, a regular number instead of a superscript or subscript number).

Searching with non-Latin scripts

To search using non-Latin scripts, you can enter characters either by pasting them from another application or using the keyboard for the appropriate language. For more information about installing and using keyboards for other languages, please visit the ITS page "Typing with Non-English Keyboards."

For Arabic or Hebrew scripts, your PC will need to be configured to support right-to-left inputting in addition to having the appropriate keyboards installed.

Search


Get Help