(Quick Reference)

2 Texts in the CITE architecture - Reference Documentation

Authors: Neel Smith and Chistopher Blackwell

Version: 1.0.beta

2 Texts in the CITE architecture

2.1 CTS URNs

bq. Uniform Resource Names (URNs) are intended to serve as persistent, location-independent, resource identifiers.

Semantics of a CTS URN

CTS URNs refer to a passage of text in terms of two hierarchies.The first hierarchy identifies a text in a model similar to the conceptual model of the Functional Requirements for Bibliographic Records (FRBR). (For an introduction to FRBR, see this basic reading list.) Where the conceptual model of FRBR aims to represent bibliographic entries as they are cataloged by librarians, however, CTS URNs aim to model works as they are cited by scholars.

CTS URNs organize works in text groups . Text groups have no direct parallel in FRBR, and do not have a predefiend semantic range. Instead, they associate works, according to traditional citation practice, in groups with various meanings. The text group may reflect authorship (e.g., a work entitled Huckleberry Finn might belong to a group named "Mark Twain"), or may represent some other kind of corpus (e.g., a work numbered 1 belonging to a group named "Federalist Papers"). Within a text group, a CTS URN's work is a conceptual entity, like the FRBR work: it is an abstract idea of the content expressed in all versions of a work, in the original language or in translation. The work may optionally be identified with increasing specificity as versions (translation or edition), or exemplars (individual physical copies). The CTS URN's version corresponds to the "expression" in the FRBR model, while exemplars correspond to "items" in FRBR parlance.

The second hierarchy in a CTS URN refers to a passage expressed in a logical citation scheme. While the nature of this hierarchy depends on the specific work referred to by a CTS URN, many texts will fall into one of a few common citation schemes. Prose works might be cited by chapter and section, or book, chapter and section, for example, or poems might be cited by line, stanza and line, or book and line, for example.

Within the smallest citation unit (such as a paragraph or section for a prose work, or line of verse for a poem), CTS URNs can further specify a span of text with a subreference . Subreferences identify indexed substrings, or a range between an indexed pair of substrings. Because subreferences are inherently language-specific, they are only valid when the work identifier is specified to the level of a version (edition or translation), or exemplar.

Resolving CTS URNs

By "resolving CTS URNs," we mean specifically the symmetric problems of how we determine what work a CTS URN refers to, and how we determine what URN values to use to refer to a work. (The further question of how to retrieve a passage of text referred to by a CTS URN is beyond the scope of the CTS URN's location-independent identifiers: see instead the related topic of Canonical Text Services.)

olving internet domain names to numeric addresses, and vice versa, and it is unsurprising that CTS URNs use analogous mechanisms solve them. Like the internet domain name system, CTS URNs must guarantee that the values used to identify a work are globally unique. Like DNS, CTS URNs achieve this by delegating responsibility for managing authoritative registries. Like a top-level DNS server, CTS URNs depend ultimately on a top-level registry listing what further CTS registries are responsible for specific domains. This top-level registry is housed in the Scaife Digital Library, a durable digital repository. (Further links to SDL will be added here when they are available.)

Just as a university or business can manage domain names within its own domain name space, an organization can manage a registry of canonical identifiers for texts within its domain. So the top-level registry assigns the identifier greekLit to a registry maintained at the Center for Hellenic Studies covering ancient Greek transmitted by manuscript copying. Other registries could be added to cover specific collections of epigraphic or papyrological texts.

Syntax of a CTS URN

URNs always begin with the string urn: followed by a protocol identifier. We use the identifier cts for our protocol.

Colons separate the top-level elements of a CTS URN: any use of a semicolon as a data value must therefore be escaped. The top-level elements are:

  1. urn name space (required: always cts)
  2. cts namespace (required: a value registered with the Scaife Digital Library list of CTS namespaces)
  3. work identifier (required: a value registered in the designated registry)
  4. passage reference (optional)
  5. subreference (optional)

The general structure of a CTS URN is therefore

urn:cts:CTSNAMESPACE:WORK:PASSAGE:SUBREFERENCE?

Periods separate second-level hierarchical components of the work identifier and passage reference. Within either of those components, any use of a period as a data value must be escaped.

CTS namespace

The work citation must include a namespace prefix registered with the CHS namespace Registry.

Work identifiers

Work identifiers are formatted as dot-separated components representing at least one of

textgroup, work, edition | translation, exemplar

Values must be registered with the registry identified by the CTS namespace component.

Example: The namespace Registry identifies the CHS registry of ancient Greek transmitted by manuscript copying with the namespace greekLit; the CHS registry in turn identifies the textgroup "Homer" with the ID tlg0012 and the work Iliad with the ID tlg001. A URN reference to the Iliad would therefore be expressed as urn:cts:greekLit:tlg0012.tlg001

Passage citations

Passage citations may refer either to individual passages or to ranges within a work.

A reference to an individual passage is formatted as dot-separated components representing one or more levels of the citation hierarchy defined in a CTS TextInventory for that work.

A reference to a range is formatted as two passage references separated by a hyphen.

CTS 3 accommodates works with multiple citation schemes. Because different parts of a work might have citation schemes with different depths to their citation hierarchy, it is essential to allow ranges to include references to beginning and end points at different depths in different citation schemes. To avoid ambiguity, each of the two passage references in a range expression must be given fully: implicit context, as is commonly used in informal normal, is not permitted. (E.g., while common informal usage allows expressions like "1.10-20" to mean "lines 10-20 of book 1," a CTS URN would require an passage expression like "1.10-1.20".)

Examples: Extending the previous example, a reference to line 10 of book 1 of the Iliad would be urn:cts:greekLit:tlg0012.tlg001:1.10. A reference to lines 10-20 of the same book would be:

urn:cts:greekLit:tlg0012.tlg001:1.10-1.20

Subreferences

Subreferences identify spans within a single citation unit using indexed substrings. See an introduction to the syntax and semantics of CTS URN subreferences.

2.2 Subreferences

Overview

A subreference points to a string of characters within a leaf-level citation node. While CTS URN references are abstract, and apply to any version of a text, subrefernces expressed in terms of strings of characters are inherently tied to a specific language. They are only valid on URNs that include work references at the version or exemplar level.

Syntax

Syntactically, substrings are set off form the passage reference they qualify by the pound sign "#" (recalling the use of the same character in URL references to refer to loations within a URL). A subreference may contain two parts: a literal string, and an index value. If an index value is included, it is enclosed in square brackets "[]" and follows any substring. The index value must evalute to a positive integer.

Semantics

At least one of the two parts of the subreference must be present. If both a substring and an index, n, are included, the reference points to the nth occurrence of the substring in the cited node. If a substring is given, but no index value, then it is taken to mean the first occurrence of the substring in the cited node. If an index is given, but no substring, it is taken to mean the nth code point in the cited node. Index values are 1-origin values.

Examples

[1] The following two URNs are equivalent:

urn:cts:greekLit:tlg0012.tlg001.mth-01:1.1#Achilles[1]

urn:cts:greekLit:tlg0012.tlg001.mth-01:1.1#Achilles

In both cases, the reference is to the first occurence of the string "Achilles" in line 1 of book 1.

[2] A subreference spanning leaf citation nodes:

urn:cts:greekLit:tlg0012.tlg001.mth-01:1.1#Achilles-1.10#Atreus

This identifies a span of text running from the first occurrence of the string "Achilles" in book 1, line 1 of a version of the Iliad , to the first occurrence of the string "Atreus" in book 1, line 10 of the same translation.

[3] Indexed substrings

urn:cts:greekLit:tlg0012.tlg001.mth-01:1.1#Achilles-1.10#the[2]

This identifies a span of text running from the first occurrence of the string "Achilles" in book 1, line 1, to the second occurrence of the string "the" in book 1, line 10 of the specified translation of Iliad .

[4] Indexed code points

urn:cts:greekLit:tlg0012.tlg001.mth-01:1.1#[4]-1.1#[6]

This URN refers to the fourth through sixth code points (inclusive) of book 1, line 1 of the Iliad , in a specified version. Note that the meaning of this will depend both on the reading of the specific version, and the digital character encoding of the specific version. In particular, for non-ASCII characters in UTF-8, it is worth emphasizing that character data values in a programming language may not be equivalent to Unicode code points in that text.

2.3 The Canonical Text Service

Summary

The Canonical Text Services (CTS) are part of the CITE architecture. The CTS specification defines a network service for identifying texts and for retrieving fragments of texts by canonical reference expressed as CTS URNs.

Introduction

See an overview of CTS.

CTS by example

See the live examples from a cts dmeo site.

Specification

See the material being added to the reference section of this guide, including Relax NG schemas defining the syntax of replies to CTS requests.

Test suite and CTS validator

A test suite includes a sample data set of texts and a TextInventory file describing them, test requests to apply to the sample data set, and files with valid responses to the test requests. "CTS validator" is a webapp that uses the test suite to measure an installation's compliance with the CTS 3 protocol. More details (with download links).

Code

The reference implementation of CTS version 3 in groovy/java, and a parallel python implementation using Google's AppEngine framework are currently being tested.

[ link to source code archive ]

2.4 Overview of the Canonical Text Service

A Brief Guide to the Canonical Text Service

What is CTS?

Canonical Text Services identify and retrieve passages of text cited by canonical reference.

Citations are expressed as CTS URNs. Text passages are structured in XML that can be validated against some schema or DTD.

Where CTS URNs define a permanent notation for citing texts, independent of any technology, Canonical Text Services provide a network service that can equate XML documents with the work referred to by a CTS URN, and can retrieve a well-formed XML fragment for a passage referred to in a CTS URN.

The CTS architecture and design goals

The Canonical Text Services protocol defines interaction between a client and a server program using the HTTP protocol: clients submit requests, with parameters included as HTTP GET parameters; the CTS response is structured in XML validating against the CTS reply schemas. While a user could therefore interact directly with a CTS by pointing a web browser at URLs formed according to the CTS specification, the purpose of the service is to provide services to software that recognizes CTS URNs.

The vocabulary of requests (highlights summarized below) allows a client to discover metadata about the collection of texts served by a specific CTS instance, as well as to retrieve passages of text.

The server's metadata catalog, called a "text inventory," identifies a means (such as a Relax NG schema) for validating the XML realization of a document, and describes how the canonical citation scheme of the CTS URN maps on to the XML representation.

Version 3 of CTS introduced three important changes. First, in CTS 3, documents may validate against any standard method chosen by the service's administrator, such as Relax NG schemas, XML schemas, or DTDs. As part of this change, CTS 3 now supports XML namespaces. Second, different parts of a document may be cited using different citation schemes. (E.g., a preface might be cited differently from the main body of a work.) Third, an optional extension that implementations may choose either to support or ignore deals with the topological relation of URNs. (For more information, see URN topology.)

Interacting with a CTS: the principal requests

Programs (and the programmers who write them) can interact with a CTS using any of the nine defined requests. The request name is always included in an HTTP parameter named request; for all requests except the metadata request GetCapabilities, a CTS URN is always included in an HTTP parameter named urn. Consider this possible series of exchanges between a client program interested in hexameter poetry, and a CTS at the address http://machine/service.

h3.GetCapabilities: What texts does the service include, and how do I cite them?

http://machine/service?request=GetCapabilities

The GetCapabilities request takes no further parameters. The reply includes the complete TextInventory, or metadata catalog, for the service. From this information, a client can determine everything the service has to offer: what texts are online, what their citation scheme looks like, whether the service supports optional features such as URN topology. (For more on the information included in a TextInventory, <a href="#textinventory">see below</a>.) The following entry for a an edition of the Homeric Hymn to Athena includes the information that the Homeric Hymns are text group tlg0013 in the greekLit CTS namespace, and that tlg011 is the short Homeric Hymn to Athena. (We could therefore identify this work succinctly with the CTS URN urn:cts:greekLit:tlg0013.tlg011.) It further tells us that the Hymn to Athena is cited by poetic line, and that citation values for poetic lines are encoded on the n@ attribute of the TEI schema's l element. But how do we determine what line numbers are valid references? For that, we can use the GetValidReff request.

<textgroup projid="greekLit:tlg0013"> <groupname xml:lang="eng">Homeric Hymns</groupname> <work xml:lang="grc-c" projid="greekLit:tlg011"> <title xml:lang="eng">Hymn to Athena</title> <edition label="chs" projid="greekLit:chs01"> <online docname="tlg0013/tlg0013.tlg011.xml" srcid="OCT"> <validate schema="http://katoptron.holycross.edu/schemas/teip5/teip5core.rng"/> <citationScheme canonical="yes" schemaId="poeticline"/> <citationMapping defaultNSAbbr="tei"> <citation scope="/TEI/text/body" xpath="/l@n = '?'" label="line"/> </citationMapping> </online> </edition> </work> </textgroup>

GetValidReff: What citation values are valid?

http://machine/service?request=GetValidReff&amp;urn=urn:cts:greekLit:tlg0013.tlg011

The urn parameter to this request identifies the Homeric Hymn to Athena. The body of the reply includes a complete list of every CTS URN that is valid for this very short poem, in the order in which they appear in the text, and could look like this:

<reply> <reff> <urn>urn:cts:greekLit:tlg0013.tlg011:1</urn> <urn>urn:cts:greekLit:tlg0013.tlg011:2</urn> <urn>urn:cts:greekLit:tlg0013.tlg011:3</urn> <urn>urn:cts:greekLit:tlg0013.tlg011:4</urn> <urn>urn:cts:greekLit:tlg0013.tlg011:5</urn> </reff> </reply>

Optionally, GetValidReff requests may include a level parameter, defining the depth of the citation scheme to consider. For a work with a single level of citation, such as a poem cited by lines, that option is irrelevant, but if wanted to discover valid references for books of the Iliad (rather than lines) included in a CTS, we could submit a GetValidReff request with a value of 1 for the level parameter. If our GetCapabilities reply tells us that the Iliad is work tlg001 in text group tlg0012 in the greekLit namespace, the request would be: </p> <p class="def"> http://machine/service?request=GetValidReff&amp;urn=urn:cts:greekLit:tlg0012.tlg001&amp;level=1 </p> <p class="def">The reply would include only 24 URNs (one for each book of the Iliad ), resolved only to the first level (books) of the citation hierarchy, not the second level of individual lines. </p> <p class="def">If we subsequently wanted to discover what line numbers are valid within book 10 of the Iliad , we could submit a urn limited to that book: </p> <p class="def"> http://machine/service?request=GetValidReff&amp;urn=urn:cts:greekLit:tlg0012.tlg001:10 </p> </li>

<li><strong>GetPassage: What is the text of this passage</strong>? <br/> http://machine/service?request=GetPassage&amp;urn=urn:cts:greekLit:tlg0013.tlg011:1

Applications might choose to batch process and store metadata about texts, and even lists of valid reference values, but the heart of the interaction between a CTS and client programs is retrieving passages of text for a given URN. The body of the reply contains a well-formed XML fragment with the requested passage of text framed by all its parent elements. The sample request above asks for line 1 of the Homeric Hymn to Athena; the body of a reply could look like this if the text were marked up in TEI-conformant XML:

<reply> <TEI> <text> <body> <l n="1">Παλλάδ' Ἀθηναίην ἐρυσίπτολιν ἄρχομ' ἀείδειν</l> </body> </text> </TEI> </reply>

GetPrevNextUrn: What is the following (or

preceding) passage?

http://machine/service?request=GetPrevNextUrn&amp;urn=urn:cts:greekLit:tlg0013.tlg011:2

The string making up the reference component of a URN is arbitrary (e.g., it is perfectly legitimate for a line labelled "320" to precede a line labelled "319"), but URNs have an inherent order: the document order of the text units they refer to. While applications can parse the results of a GetValidReff to determine what URNs precede or follow a given URN, it is also possible to request this information directly. The example asks for the URNs preceding and following line 2 of the Homeric Hymn to Athena. The body of the reply would be:

<reply> <prevnext> <prev>urn:cts:greekLit:tlg0013.tlg011:1</prev> <next>urn:cts:greekLit:tlg0013.tlg011:3</next> </prevnext> </reply>

GetPassagePlus: Can we simplify this exchange?

Applications supporting navigation of a text regularly need to submit GetPassage and GetPrevNextUrn in tandem. To simplify this (and cut in half the number of client/server round trips needed to navigate a text), the GetPassagePlus request works exactly like the GetPassage request, except that it packages in the reply both the XML of the requested passage, and the prevnext element of a GetPrevNextUrn request.

Managing a CTS: the TextInventory

A CTS implementation might manage the service's metadata in any way it chooses. It might store the data in a database with a form-based user interface, for example. But the metadata is presented to client applications as XML validating against the CTS TextInventory schema, so we will survey the main components of the TextInventory as they appear serialized to XML.

The TextInventory includes three main parts:

  1. a list of standard citation schemes
  2. a list of the individual TextGroups, Works, Editions,

Translations, and Exemplars of documents known to the server

  1. a list of organization units called Collections

The list of groups, works, etc., is a hierarchical organization used to identify works uniquely, according to some familiar, well established convention. The collections on the other hand allow the administrator of a CTS to group sets of works together for any purpose.

Of these three sections, the most important is the list of groups and works. It is organized as follows

The Text Inventory: Groups and Works

The list of works contains a list of…

  • ‚Ķone or more TextGroup elements (e.g. ‚ÄúHomer,‚Äù ‚ÄúAristotle‚Äù,

“inscriptions from a given site”).

Textgroups are traditional, convenient groupings of texts such as “authors” for literary works, or corpus collections for epigraphic or papyrological texts. Each TextGroup has a unique identifier, one or more titles (allowing titles in different languages), and consists of…</p>

*…one or more Work elements (e.g. “Iliad,” “Ἀθηναίων Πολιτεία”)

Works are notional entities, each with an identifier unique within this TextGroup. Each work includes one or more titles, and, optionally, may be instantiated in…

*…zero or more Edition elements and/or Translation elements

Editions and translations are specific versions of a notional work, that may be represented by multiple physical copies. Each has an identifier unique within the Work. The TextInventory may here list bibliographic information. since the Canonical Text Services protocol allows editors to work with information about texts that are online and texts that are not. Further, an Edition or Translation may optionally contain …

*…zero or more Exemplar elements.

Exemplars are specific physical copies of an Edition or Translation. Each has an identifier unique within its containing Edition or Translation. Documenting individual examplars can be particularly important for early print editions, but would also allow an epigraphic editor the option of treating multiple copies of a single inscription as exemplars of an edition.

If the server can deliver an electronic version at the level of the Edition element, the Translation element or one of their Exemplars, that element will contain…

*…one Online element

The Online element contains information about the citation scheme of that electronic text. (See details below.) It also includes information that a server implementation could use to translate the abstract reference into terms used for local retrieval, such as a filename or database lookup.

So, for example, a TextInventory entry for the Homeric Hymn to Athena could contain the following information:

TextGroup: tlg0013 (Homeric Hymns) Work: tlg011 ( Hymn to Athena ) Edition: chs01 (CHS electronic edition) Online: local document reference = tlg0013/tlg0013.tlg011.chs02.xml Translation: chs02 (English translation by H. Evelyn-White, now in the public domain)

Each Online element—be it an edition, translation, or exemplar— contains three elements: one identifies how the XML document can be validated, a second identifies the citation scheme with an identifier from the list of citation schemes used in this service, and a third element contains a recursive list of citation elements mapping each level of the citation scheme to part of the XML document.

Our example of the Homeric Hymn to Athena cites by a single level, the poetic line, and could be documented like this:

<pre> <online docname="tlg0013/tlg0013.tlg011.chs01.xml" srcid="OCT"> <validate schema="http://katoptron.holycross.edu/schemas/teip5/teip5core.rng"/> <citationScheme schemaId="poeticline" canonical="yes"/> <citationMapping defaultNSAbbr="tei"> <citation label="line" xpath="/l@n = '?'" scope="/TEI/text/body"/> </citationMapping> </online> </pre>

An online element for the two-tiered citation of the Iliad illustrates the usage of the citation element's scope and xpath attributes. Each provide templates for XPath expressions, in which question marks (?) can be replaced by the value of one level of a citation. The xpath attribute identifies an XML unit corresponding to a level of the citation scheme; the scope attribute identifies a context in the document where this xpath applies. (The two are distinct because a document's markup might include markup between levels of the citation scheme.)

<online docname="tlg0012/tlg0012.tlg001.hmt-msA.xml" srcid="OCT"> <validate schema="http://katoptron.holycross.edu/schemas/teip5/teip5core.rng"/> <citationScheme schemaId="bookAndPoeticline" canonical="yes"/> <citationMapping defaultNSAbbr="tei"> <citation label="book" xpath="/divtype='book' n = '?'" scope="/TEI/text/body"> <citation label="line" xpath="/ln = '?'" scope="/TEI/text/body/divtype = 'book' and @n = '?'"> </citation> </citationMapping> </online>

Further information

More detailed information about version 3 of CTS is currently in preparation; links will be posted here when it is made available from the project's sourceforge site.

2.5 The CTS Test Suite

The CTS test suite.

[download link]

2.6 URN Topology

urn topology