Indexes

Processors can generate indexes from the content of indexing elements.

Index overview

DITA provides several elements to enable indexing. Whether and how an index is rendered will vary based on implementation decisions and rendering formats.

Here are some definitions:

  • An index is a mapping from <indexterm> elements to locations in the DITA content.
  • A generated index is a mapping of index terms to rendered locations.

While DITA provides several elements that support indexing, how those elements are used will vary by implementation.

  • A publishing format like PDF might use a back-of-the-book style index with page numbers, which typically involves merging index elements and generating page numbers.
  • Another publishing format might have no rendered index, but it would instead use the content of index elements to help weight search results.
  • Some implementations might choose to supplement a generated index with additional content, such as treating a specialized <keyword> element as both normal content and an index entry.
  • Implementations might have different ways to render indexing edge cases, based on either implementation capabilities or style preferences.

While DITA defines markup for indexing and specifies exactly the point to which an <indexterm> refers, it cannot force DITA documents to use consistent patterns that work for all formats. Implementations should consider edge cases and how to treat them.

The following list includes some of the conditions that implementations might want to be aware of when considering how to generate an index:
  • Index processors typically ignore leading and trailing whitespace characters.
  • Processors might want to treat two entries separately if they are defined with different capitalization.
  • Processors need to determine how to handle nested markup, such as an <keyword> element that is located within an <indexterm> element.
  • Because <index-see> is used to refer to a term that is used instead of the current entry, processors should consider how to handle a case where an index term is used both as a page locator and with an <index-see> element for redirection.
  • Similarly, processors should consider how to handle the case where an index term is defined with both an <index-see> and an <index-see-also> element.

Index elements

The contents of <indexterm> elements provides the text for the entries in an index. <indexterm> elements can be nested to create additional levels of indexing, such as secondary and tertiary index entries.

The following elements contain information that processors can use to generate indexes:

<indexterm>
Defines a term that can contribute to an index. Matching values of @start and @end attributes on <indexterm> elements can specify an index range.
<index-see>
Defines a term to use as a see reference. See references direct a reader to the preferred term.
<index-see-also>
Defines a term to use as a see also reference. See also references direct a reader to an alternate index entry for additional information.

How the index elements are combined, the location of <indexterm> elements, and the hierarchy of the DITA maps all affect how the index elements are processed and the index entries that are generated.

Location of <indexterm> elements

<indexterm> elements can occur in topic prologs, anywhere else in DITA topics, and in DITA maps.

The location of an <indexterm> element determines the point in the document that the element references.

Topic prologs
An <indexterm> element that is located in a topic prolog is a point reference to the title of the topic. If an <indexterm> element has an @end attribute, it is a point reference to the end of the topic, including any sub-topics.
Anywhere else in a DITA topic
An <indexterm> element that is located in a topic (and not in the topic prolog) is a point reference to the location where the <indexterm> element occurs.
DITA maps
An <indexterm> element that is contained within a <topicref> element is a point reference to the title of the topic. If an <indexterm> element has an @end attribute, it is a point reference to the end of the branch that is specified by the topic reference. If the topic reference is not bound to a resource, the <indexterm> element has no stated purpose.

Index locators

An <indexterm> element binds the content of the element, typically a term, to a specific location in a document.

The nesting of <indexterm> elements and the presence of <index-see> elements determines whether locators are rendered in generated indexes:

  • An <indexterm> element that does not contain child <indexterm> elements (or an <index-see> element) contributes a locator to the generated index entry.
  • An <indexterm> element that contains child <indexterm> elements contributes to the hierarchy of the multilevel index entry that is generated. Only a leaf <indexterm> element contributes a locator to the generated index entry. A leaf <indexterm> element is an <indexterm> element that does not contain any other <indexterm> elements.
  • If an <indexterm> element also contains one or more <index-see> elements, no locator is included in the generated index entry.
  • If an <indexterm> element also contains one or more <index-see-also> elements, the <indexterm> element contributes a locator to the generated index entry, and <index-see-also> element provides only a redirection.

Index redirection

The <index-see> and <index-see-also> elements enable redirection to other index entries within a generated index.

The <index-see> element contains text for an index entry that the reader should use instead of the current one, whereas the <index-see-also> element contains text for an index entry that the reader should use in addition to the current one.

Index ranges

Authors can use the @start and @end attributes on a pair of <indexterm> elements to index an extended discussion. The generated index entry reflects the span between the two <indexterm> elements.

The start of an index range is indicated by an <indexterm> with a @start attribute. This is called a start-of-range element.

The end of an index range is indicated by an <indexterm> element with an @end attribute with a value that matches the @start attribute on the start element. This is called an end-of-range element. End-of-range element should contain no content or nested elements.

The start-of-range and end-of-range elements must be leaf <indexterm> elements. If part of a multilevel index entry, the start-of-range and end-of-range elements must be at the same level of the hierarchy.

The location of the <indexterm> elements determines how the range is defined:

Topic body
The start-of-range and end-of-range elements are in the body of the same DITA topic. The range is defined as between two point references in the DITA topic. If an end-of-range element does not exist within the same topic body, the start-of-range element is treated as a point reference rather than as the start of a range.
Topic prolog
The start-of-range and end-of-range elements are in the prolog of the same DITA topic. The range is defined as being between the title of the DITA topic and the end of the last nested topic. If an end-of-range element does not exist within the topic prolog, the start-of-range element is treated as a point reference rather than as the start of a range.
DITA map
The start-of-range and end-of-range elements are contained within topic references in the same DITA map. If an end-of-range element does not exist within the same map, the start-of-range element is treated as a point reference rather than as the start of a range.
Processors that support index ranges SHOULD do the following:
  • Match @start and @end attributes by a character-by-character comparison with all characters significant and no case folding occurring.
  • Ignore @start and @end attributes if they occur on an <indexterm> element that has child <indexterm> elements.
  • Handle an end-of-range <indexterm> element that is nested within one or more <indexterm> elements. The end-of-range <indexterm> element should have no content of its own; if it contains content, that content is ignored.
  • When index ranges with the same identifier overlap, the effective range is determined by matching the earliest start-of-range element from the set of overlapping ranges with the latest end-of-range element from the set of overlapping ranges.
  • An unmatched start-of-range element is treated as a simple <indexterm>element.
  • Ignore unmatched end-of-range <indexterm> elements.

Index sorting

The combination of an <indexterm> and a <sort-as> element specifies a sort phrase under which an index entry is grouped or sorted.

This gives an author the flexibility to sort an index entry in an index differently from how its text normally would be grouped or sorted. The common use for this scenario is to disregard insignificant leading text, such as punctuation or words like "the" or "a". For example, the author might want <data> to be sorted under the letter D rather than the left angle bracket (<). An author might want to include such an entry under both the punctuation heading and the letter D, in which case there can be two index entries differentiated only by the sort-as value.

Certain languages have special sort order needs. For example, Japanese index entries might be written partially or wholly in kanji, but need to be sorted in phonetic order according to its hiragana/katakana rendition. There is no reliable automated way to map written to phonetic text: for kanji text, there can be multiple phonetic possibilities depending on the context. The only way to correctly sort Japanese index entries is to keep the phonetic counterparts with the written forms. The phonetic text would be presented as the sort-as value for indexing purposes.

Examples of indexing

This section is non-normative.

This section contains examples and scenarios that illustrate the use and processing of indexing elements.

Example: Index range defined in a single topic

This section is non-normative.

In this scenario, an index range is defined directly in the body of a topic.

In the following code sample, the index range begins at the start of the second paragraph and continues to the beginning of the last paragraph.

<topic id="accounting">
  <title>Accounting regulations</title>
  <body>
    <p>Be ethical in your accounting.</p>
    <p><indexterm start="acctrules">rules</indexterm>Remember to do all of the following: ...</p>
    <!-- ...pages worth of rules... -->
    <p><indexterm end="acctrules"/>Failure to comply will get you audited.</p>
  </body>
  <!-- Potential sub-topics -->
</topic>

Example: Index range defined in a topic prolog

This section is non-normative.

In this scenario, an index range is defined in the topic prolog. Ranges defined in a prolog cover subtopics, including those nested based on a map.

Specifying an index range in a topic prolog is useful for defining an index range that contains a topic and its children.

Consider the following DITA map which contains topics about a small company's operating procedures. The map contains a topic about accounting (acct.dita), which has child topics: procedures.dita and forms.dita.

<map>
  <title>Company procedures</title>
  <topicref href="acct.dita">
    <topicref href="procedures.dita"/>
    <topicref href="forms.dita"/>
  </topicref>
  <!-- ... -->
</map>

The information developer wants an index entry that will span acct.dita and its children. They use the following markup in acct.dita:

<topic id="accounting-at-acme">
  <title>Accounting at Acme</title>
  <prolog>
    <metadata>
      <keywords>
        <indexterm start="acct">accounting</indexterm>
        <indexterm end="acct"/>
      </keywords>
    </metadata>
  </prolog>
  <!-- ... -->
</topic>

This markup specifies that the index range begins with the start of the topic title, and the end of the range is the end of the forms.dita topic. The index range includes the "Accounting at Acme" topic and its two child topics.

Example: Index range defined in a map

This section is non-normative.

In this scenario, an index range is defined in the DITA map. Ranges defined in a DITA map can span topics.

Consider the following DITA map:

<map>
  <title>Food available in the Acme cafeteria</title>
  <!-- ... -->
  <topicref href="apples.dita">
    <topicmeta>
      <keywords>
        <indexterm start="acme-fruit">fruit</indexterm>
      </keywords>
    </topicmeta>
  </topicref>
  <topicref href="oranges.dita"/>
  <topicref href="pineapples.dita">
    <topicmeta>
      <keywords>
        <indexterm end="acme-fruit"/>
      </keywords>
    </topicmeta>
  </topicref>
  <!-- ... -->
</map>

The index range begins with the start of the first topic title in apples.dita, and it continues until the end of the last element in pineapples.dita.