Field descriptions of a ContentHub document

This page describes the fields of a ContentHub document.

ch:haufe-document

Root element of every ContentHub document is ch:haufe-document.

Prerequisites

Supported namespaces

Namespace prefix URL
ch http://contenthub.haufe-lexware.com/haufe-document
chb http://contenthub.haufe-lexware.com/baseline-format-schema
am http://idesk.haufe-lexware.com/document-meta
xs http://www.w3.org/2001/XMLSchema
xhtml http://www.w3.org/1999/xhtml

Note 1:

From the XML perspective, neither the specific namespace prefixes chosen nor the element where the namespaces are declared (or whether a namespace is used as default namespace without prefix) do matter. By convention, though, the ContentHub document namespace is used with the prefix ch and the ContentHub baseline format namespace with the prefix chb. These namespace prefixes are typically declared on the root element.

Most of the time you will see additional namespace declarations for use in custom metadata. (In this example, the prefix am – presumably short for "Aurora metadata" – is assigned by the production line for use in iDesk-specific metadata.)

Note 2:

With very few exceptions, both the ContentHub document and the baseline schema allow to add arbitrary attributes to all elements.

Note 3:

The XML Schema type xs:dateTime accepts values with and without timezone offset. We recommend that content producing systems always write values with an offset and that content consuming systems assume UTC if the offset is absent. Then the parsed value might be one or two hours off, but at least the error is always consistent, no matter whether the document is parsed in summer (daylight saving time) or winter.

XML timestamps of type xs:dateTime can store values with almost arbitrary precision. Content producing systems SHOULD round these values to a reasonable temporal unit (say, to seconds).

Special care is needed if a content producing application keeps records of the date of a timestamp only and not of the time of day. If such a timestamp is serialized as an xs:dateTime with the time part set to midnight (00:00:00) of the day in question, then - depending on the time zone the value is parsed in - the publication date may fall on the previous day. For all expected Haufe use cases, the pragmatic workaround is to set the time part of the timestamp to noon (CET or UTC).

ch:meta

Every ContentHub document starts with a ch:meta element that holds the document's metadata.

The schema defines ch:meta as a sequence of elements, some required and some optional. All elements in the ContentHub namespace come first (often referred to as the "well-known metadata"); then custom metadata elements from other namespaces may follow.

Note: The order of the elements in the ContentHub namespace is relevant! The "Unique Particle Attribution" constraint of XML Schema 1.0 (https://www.w3.org/TR/xmlschema-1/#non-ambig) makes it impossible to define ch:meta in a way that ignores the element order.

ch:publicationDate

ch:publicationDate (type xs:dateTime) may appear at most once and describes the point in time when the first revision of this document was published.

<ch:publicationDate>2014-05-21T17:34:00+02:00</ch:publicationDate>

ch:revisionDate

ch:revisionDate (type xs:dateTime) may appear at most once and describes the point in time when this revision of the document was published. If both ch:revisionDate and ch:publicationDate are present, then ch:revisionDate MUST NOT fall before ch:publicationDate.

<ch:revisionDate>2015-01-01T12:00:00+00:00</ch:revisionDate>

ch:chronologicalStartDate

From an XML schema perspective, ch:chronologicalSortDate (type xs:dateTime) may appear at most once. However, the ContentHub is supposed to add this element, where absent, to all documents; therefore, content consuming applications can expect this element to be present exactly once.

ch:chronologicalSortDate specifies the point in time used when a content consuming application needs to sort documents in chronological order. If ch:publicationDate or ch:revisionDate are present, then ch:chronologicalDocument SHOULD be equal to one of these values.

If ch:chronologicalDocument is absent on ingest into the ContentHub, then the ContentHub assigns the first available value from the following list:

The producing application MAY choose to specify a sort date in the future. Then the producing application SHOULD also set metadata that excludes this document from a typical end-user search result set (e.g., set ch:visible to false), so consuming applications do not present documents to end-users that are not formally published yet.

<ch:chronologicalSortDate>2015-01-01T12:00:00Z</ch:chronologicalSortDate>

ch:visible

When searching in ContentHub, ContentHub will by default exclude documents with ch:visible set to false from the search result. This behavior can be overruled by consuming applications by appending one of the following expressions to their search expression:

However, consuming applications SHOULD NOT present hidden documents to a user. There are legitimate reasons the produce decided to keep these documents hidden.

The ch:visible element can appear at most once. If absent, ContentHub will add this element with value true to the metadata.

<ch:visible>true</ch:visible>

ch:ingestOnly

ch:ingestOnly is relevant for content producing applications only. If its value is true, then ContentHub will not make this document available through any content consumer API. This element exists only for the benefit of some content producers that, for technical reasons, need to store extra data with their content collection in ContentHub.

The ch:ingestOnly element can appear at most once. If absent, then its value is assumed to be false.

From a content consumer's perspective, this element has always the value false or the element is absent.

<ch:ingestOnly>false</ch:ingestOnly>

ch:blob or ch:img

ch:blob and ch:img have both some required attributes but no content. Each of these elements describe a binary object attached to this document.

There may be zero or more ch:blob and ch:img elements.

Required attributes:

Optional well-known attributes:

ch:img is in the XML schema "substitution group" of ch:blob and can therefore appear wherever the schema expect a ch:blob. In particular, ch:blob and ch:img elements may appear in any order.

<ch:blob name="calculatorArchive"
         type="application/zip"
         locator="./attachments/calculator.zip"/>
<ch:img name="illustration"
        type="application/png"
        locator="/images/img1234.png"
        width="600"
        description="some stock photo"/>

ch:contentLanguage

ch:contentLanguage (type xs:token) specifies the language of the content document using the tags defined by RFC 5646.

The element ch:contentLanguage may occur at most once; if absent, its value is assumed to be "de" (for German).

<ch:contentLanguage>de</ch:contentLanguage>

ch:navigationPath

Some content producing applications place their documents into a navigation hierarchy. These producers SHOULD document the navigation path of each document in the element ch:navigationPath (type xs:token). They SHOULD use an encoded URI path syntax.

The element ch:navigationPath may occur zero or more times.

<ch:navigationPath>Auslandsreisen/Österreich/Krankenversicherung</ch:navigationPath>

ch:tag

Each ch:tag element (type xs:token) describes a single keyword (German: "Stichwort") associated with the document. The ch:tag element can appear zero or more times.

So far, there is no separate well-known metadata element for catchwords (German: "Schlagwort"). Therefore, some content producers may choose to put them in the ch:tag element as well, others may prefer to keep them separate in a custom metadata element.

ch:tags are used by ContentHub to calculate contextual related documents.

<ch:tag>some keyword</ch:tag>
<ch:tag>another keyword</ch:tag>

ch:quickSearchPhrase

Each ch:quickSearchPhrase element (type xs:token) specifies a search phrase this document is supposed to be a perfect match for (according to the content producer). The ch:quickSearchPhrase element may occur zero or more times.

For instance, production line copies on each iDesk document the value of ch:appDocId (the document's "HI") into a ch:quickSearchPhrase element because on iDesk, some users search for specific documents by their HI. In addition, if the document is part of a norm, then the name of the norm and the specific section as it is typically referenced are present in possibly multiple variations, each in their own ch:quickSearchPhrase element.

<ch:quickSearchPhrase>10 210-5</ch:quickSearchPhrase>
<ch:quickSearchPhrase>10 PassG</ch:quickSearchPhrase>
<ch:quickSearchPhrase>10 Passgesetz</ch:quickSearchPhrase>
<ch:quickSearchPhrase>§ 10 PassG</ch:quickSearchPhrase>
<ch:quickSearchPhrase>HI1012816</ch:quickSearchPhrase>

ch:relatedDocument

The ch:relatedDocument element describes a provided related document. For contextual related documents the tags are being used. So the two ways of finding related documents are:

Inside the tag, the following attributes can be found:

ch:publisher

The ch:publisher element (type xs:token) describes "an entity responsible for making the resource available" (as defined by DCMI Metadata Terms, "Publisher"). In general, the value of this element describes a person or organization. In the ContentHub, the publisher of many documents will be Haufe-Lexware GmbH & Co. KG.

The publisher must neither be confused with the creator of the document nor with the content producing application. The former authored the document, but may not have published it. The latter is merely a technical application that may or may not be operated by the publisher.

The ch:publisher element may occur zero or more times.

<ch:publisher>Haufe-Lexware GmbH &amp; Co. KG</ch:publisher>

ch:creator

The ch:creator element (type xs:token) describes "an entity primarily responsible for making the resource" (as defined by DCMI Metadata Terms, "Creator"). The value might describe an organization or institution, but most often the creator is a natural person (i.e., the document's author or possibly editor). The id is the author-id from Haufe's Autorendatenbank.

The ch:creator element may occur zero or more times.

<ch:creator id="7028">Prof. Dr. Max Muster</ch:creator>

ch:title

The ch:title element (type xs:token) describes "a name given to the resource" (as defined by DCMI Metadata Terms, "Title"). If the document is visible, then its producing application SHOULD always provide a title with (implicit or explicit) attribute name="default" suitable for display on a search result list.

If a producing application associates multiple titles with a document (e.g., short title, long title, etc.), then it can provide them as well using custom values for the name attribute.

The schema definition of the ch:title imposes a uniqueness constraint on the name attributes of all ch:title elements within a document. The name attribute of ch:title has the default value "default".

The ch:title element may occur zero or more times.

<ch:title>§ 10 Untersagung der Ausreise</ch:title>
<ch:title name="shortTitle">Untersagung der Ausreise</ch:title>
<ch:title name="compoundTitle">Paßgesetz / § 10 Untersagung der Ausreise</ch:title>
<ch:title name="shortCompoundTitle">Paßgesetz / § 10 Untersagung der Ausreise</ch:title>
<ch:title name="vaInfoTitle">Behoerde, Datum, Aktenzeichen</ch:title>

ch:rightsHolder

The ch:rightsHolder element (type xs:token) describes "a person or organization owning or managing rights over the resource" (as defined by DCMI Metadata Terms, "Rights Holder"). A content producing application SHOULD always provide this value, in particular if the rights holder is not part of the Haufe Group.

The ch:rightsHolder element may occur at most once.

<ch:rightsHolder>Haufe</ch:rightsHolder>

ch:packageId

Some applications bundle content into individual packages that end-users can obtain or license separately. For instance, iDesk content is offered as part of many different Office-Line products identified by their respective "PI".

Each element ch:packageId (type xs:NCName) specifies one such package this document is part of. The element may occur zero or more times.

<ch:packageId>PI10413</ch:packageId>
<ch:packageId>PI11525</ch:packageId>

ch:canonicalUrl

The element ch:canonicalUrl addresses a vexing topic. As a content consumer, you need to care if and only if web search engines like Google index your site.

For web search engines, duplicate URLs to essentially the same content are bothersome: Not only do they cause unnecessary bloat to their (already daunting) indices, they also have a negative impact on the search experience of their users. On the other hand, there are often good reason why content needs to be available under more than one URL.

Therefore, web search engines require that all sites declare for each page with (single- or cross-domain) duplicate content the preferred "primary" (i.e., "canonical") URL - often by a canonical tag in the HTML header, but possibly also through site maps or other means. If Google and their likes detect non-declared duplicate content, they penalize the respective site's SEO ranking. In other words, they make the site much less visible on the web, with all economical consequences this entails.

In the element ch:canonicalUrl (type xs:anyURI), the content producer specifies the canonical URL the content consumers MUST declare for any page that shows this document if their site is crawled by search engine bots.

Obviously, the element ch:canonicalUrl can occur at most once.

<ch:canonicalUrl>
  https://www.haufe.de/personal/haufe-personal-office-platin/passgesetz-10-untersagung-der-ausreise_idesk_PI42323_HI1005967.html
</ch:canonicalUrl>

ch:appDocId

The element ch:appDocId (type xs:anyURI), short for "application-specific document identifier" contains the identifier of this document assigned by the content producer.

The value of ch:appDocId is always a properly percent-encoded relative URI path; for syntactic reasons, it must start with "./" if the relative path's first segment contains a colon. (For instance, iDesk always uses a single-segment path of the form "HI1234", "LI2345.6" or similar. haufe.de uses identifiers of the form "content/345678". Haufe Online Training assigns UUIDs.)

The value of ch:appDocId is necessarily unique among all documents from the same content producer (i.e., all documents that have the same ch:application). There is no such guarantee across content producers.

Unfortunately, the ContentHub metadata schema gives no guarantee how many ch:appDocId elements may occur within a document.

As it stands, content consumers can always expect exactly one ch:appDocId element.

<ch:appDocId>HI1005967</ch:appDocId>

ch:application

The element ch:application identifies the producer of this document. The referenced "content producing application" uploaded this document into the ContentHub and it is the only application that may update or delete this document.

The value type of ch:application is a restriction of xs:NCName that allows ASCII letters, digits, and the underscore '_' only.

The element ch:application must always be present exactly once.

<ch:application>idesk</ch:application>

ch:contentHubId

The element ch:contentHubId (type xs:anyURI) combines ch:appDocId and ch:application into a single URI and is therefore unique within all ContentHub content.

The value of ch:contentHubId is always a percent-escaped absolute URI constructed by resolving the relative URI path in ch:appDocId against the base URI "contenthub://{ch:application}/".

With respect to the number of occurrences of ch:contentHubId, the same holds as for ch:appDocId:

ch:fingerprint

ch:fingerprint is a hash value calculated based on the fields of the document. A changed fingerprint means there was at least one change in the document.

<ch:fingerprint>f333825ef196f52592cf001bfd38f886</ch:fingerprint>

ch:documentType

In most cases, the content loaded into the ContentHub by a single content producer is not uniform. It can be classified into different types; the element ch:documentType (type xs:token) specifies the type a document belongs to.

Content producers are supposed to document the values of ch:documentType they assign, the semantics of these types, and which (possibly custom) metadata values a consumer can expect for documents of each type, respectively.

The element ch:documentType must appear exactly once.

<ch:documentType>NORM</ch:documentType>

Custom metadata fields

From here on, content producers are technically free to add "any well-formed XML".

In practice, producers are supposed to add metadata elements only that belong to a producer-specific namespace and to document all custom metadata elements. Furthermore, it is advisable that producers stick to simple value types. ("Simple" in the sense of XML schema – i.e., no nested XML elements.)

am:documentType

The following shows some custom metadata elements as generated by the application "idesk".

<am:documentType>BEITRAG</am:documentType>

am:language

The language of the iDesk document. am:language is most likely redundant, ch:contentLanguage should be used instead.

<am:language>de</am:language>

am:resortId

The attribute intra-idx of the ressort gldg. Is available only if the document is a ressort gldg, especially not for the contained subdocuments.

<am:ressortId>HI1012816</am:ressortId>

am:documentClassification

iDesk classifies its documents in a rather elaborate way. It distinguishes the document's am:documentType (not identical to ch:documentType), am:documentCategory, and, in some cases, am:documentSubcategory.

The element am:documentClassification joins the non-empty values of these elements in the mentioned order, separated by an underscore.

<am:documentClassification>NORM_GESETZ</am:documentClassification>

am:documentType

The element am:documentType describes the most coarse classification of this iDesk document. Typical values are NORM, RECHTSQUELLE, ENTSCH, KOMMENTAR, BEITRAG, ARBEITSHILFE, and MUSTERDOKUMENT.

<am:documentType>NORM</am:documentType>

am:documentCategory

All iDesk documents are also assigned an am:documentCategory.

In the case of court decisions and proceedings, for instance, am:documentCategory describes the type of jurisdiction (constitutional, labor, administrative, etc.).

<am:documentCategory>GESETZ</am:documentCategory>

am:subCategory

The element am:Subcategory (only present on court decisions and proceedings) describes the respective court instance.

<am:documentCategory>SONST</am:documentCategory>

am:rootId

For some document types, iDesk documents are grouped into "logical" documents; the individual ContentHub documents are in fact sections within such a logical document. The element am:rootId contains the ch:appDocId of the ContentHub document that represents the top-most section of this document's logical document. (Of course, it is possible that this ContentHub document is the root document; then am:rootId is equal to the value of ch:appDocId of this document.)

<am:rootId>HI1012816</am:rootId>

am:preDocId

The element am:preDocId specifies the application-specific id of the ContentHub document that represents the immediately preceding section within the logical document.

Obviously, this element is absent if this document is the root document.

<am:preDocId>HI1005965</am:preDocId>

am:sucDocId

The element am:sucDocId specifies the application-specific id of the ContentHub document that represents the immediately succeeding section within the logical document.

This element is absent if this document represents the last section within its logical document (or, accordingly, if this document is not split into multiple sections).

<am:sucDocId>HI2177965</am:sucDocId>

am:outlinePath

In a document outline, the sections form (in general) a hierarchical tree. (E.g., section 1.5.3 might be followed by section 1.6 that is followed by its first "child" section 1.6.1 and so on.)

The element am:outlinePath describes the path from the root document along this tree's child axis to this document. Syntactically, its a sequence of application-specific document identifiers separated by a forward slash.

Note that the application-specific id of idesk documents never contains slashes.

<am:outlinePath>HI1012816/HI1005950/HI1005967/</am:outlinePath>

am:rootShortTitle

Following the DCMI (Dublin Core Metadata Initiative) definition, ch:title is supposed to describe this resource (i.e., this document / section). Often, one needs quick access to the title of the logical document as well. Therefore, the element am:rootShortTitle provides the short title of the corresponding root document.

<am:rootShortTitle>Paßgesetz</am:rootShortTitle>

am:isRoot

The element am:isRoot (type xs:boolean) is true if and only if this document is the root of its logical document.

<am:isRoot>false</am:isRoot>

am:subjectArea

The am:subjectArea element is used to assign responsibility in the editorial department and thus also the billing of the Honorarkosten. Contains the text values for subject area ID obtained from the subject area list, which is maintained in HCS; the values of the three levels are separated by pipe.

Multiple elements with one text value corresponding to the subject area id.

<am:subjectArea>Arbeitsrecht | Abwesenheiten | Krankheit</am:subjectArea>

am:subjectAreaId

am:subjectAreaId metadata can consist of multiple elements each one contains single subject area id.

<am:subjectAreaId>702-0</am:subjectAreaId>
<am:subjectAreaId>49-1</am:subjectAreaId>

am:court

am:court contains the name of the court involved in the court decision or proceeding. The element can occur at most once.

<am:court>SG Karlsruhe</am:court>

am:referenceNumber

am:referenceNumber contains the reference number of the court decision or proceeding. The element can occur at most once.

<am:referenceNumber>S 12 AS 2208/22</am:referenceNumber>

am:law

All laws that are part of the decision (German: "Normenkette") are inserted as am:law elements into the metadata section. If the source can be split in a meaningful way, the attributes name and section will be inserted too. The element may occur zero or more times.

<am:law name="EUVO 1387/2013" section="Art. 1 Abs. 1">VO (EU) 1387/2013 Art. 1 Abs. 1</am:law>
<am:law>KN UPos 8504 4090</am:law>

am:infoSource

The element am:infoSource is used to collect all publications (German: "Fundstellen") regarding the court decision or administrative order or the like. The elements have attributes for the publication name (most commonly an abbreviation), the year of publication and the pages. The element may occur zero or more times.

<am:infoSource name="BB" year="1994" page="1430">BB 1994, 1430</am:infoSource>

am:decisionStatus

The element am:decisionStatus contains different values for court decisions and proceedings.

In court decisions the values are "V" for officially published decisions (German: "zur amtlichen Veröffentlichung bestimmt") and "NV" for not officially published decisions (German: "nicht zur amtlichen Veröffentlichung bestimmt"). The element can occur at most once and only for court decisions of the BFH.

For proceedings, the values can be "ANH" for still ongoing proceedings (German: "anhängig") and "ERL" for finished proceedings (German: "erledigt").
The element must appear exactly once.

<am:decisionStatus>V</am:decisionStatus>

am:publicationDate

The element am:publicationDate holds the publication date of the court decision. The element can occur at most once.

<am:publicationDate>2021-06-25T00:00:00</am:publicationDate>

am:parties

The elements am:parties hold the names of the participants. The element has an optional attribute role with the possible values "plaintiff" (German: "Kläger") and "defendant" (German: "Beklagter").

The element may occur zero or more times.

<am:parties>V</am:parties>

am:previousInstance and am:nextInstance

The elements contain information about the "previous" (German: "Vorinstanz") and "following" court instances (German: "Nachinstanz"). The element may occur zero or more times.

<am:previousInstance court="ArbG Chemnitz" date="1999-09-09" referencenumber="1 Ca 2077/99">ArbG Chemnitz, Urteil vom 09.09.1999 - 1 Ca 2077/99</am:previousInstance>
<am:nextInstance court="EuGH" date="2018-11-06" referencenumber="C-569/16 und C-570/16">EuGH, Entscheidung vom 06.11.2018 - C-569/16 und C-570/16</am:nextInstance>

chb:preview

The chb:preview element contains mixed content subject to the ContentHub's baseline schema. It primarily serves use cases in relation to iDesk's logical documents.

Most content consuming applications that provide a search interface to their users include some kind of document preview into their search result list. Some go for search hit highlighting that shows document fragments where the search query matched, others prefer to always show the beginning of the document. Either approach assumes the document's body is not empty.

However, the root documents of longer iDesk articles are often empty – the text starts only with the first non-root section. On the other hand, if the user searches for "company car" and the application's corpus contains an encyclopedic article (i.e., an idesk document with am:documentClassification equal to "BEITRAG_LEXIKON") on the topic of company cars, then one would expect the root of this logical document high up in the search result list.

The content of chb:preview is meant to be the source of the content excerpts shown to the users in such cases. Practically speaking, it is an excerpt from the logical document starting at the section represented by this section, possibly supplemented by content from following sections. If this document contains enough content already, then it's supposed to simply be the beginning of this document.

The element chb:preview has a boolean-valued attribute "autogenerated". If false, then the content of chb:preview was provided by the content producer and the ContentHub is merely passing it on. If true, then the ContentHub generated the preview by means of its "teaser" algorithm.

<chb:preview autogenerated="false" xmlns="http://www.w3.org/1999/xhtml">
  <p>Some text</p>
</chb:preview>

ch:baselineSearchableText

The element ch:baselineSearchableText may contain any mixed XML content that is never shown to end users but still supposed to be searched on every fulltext search of this document's baseline content. Therefore, this element is relevant for content consuming applications only if they implement their own search.

Besides covering expected user search phrases not handled by synonym processing etc., ch:baselineSearchableText primarily offers a place to store textual descriptions of binary document attachments (referenced by the metadata elements ch:blob and ch:img). For instance, this might be the textual description of an image or diagram, or it could be text extracted from a PDF.

In practice, most consumers will ignore all markup and use the text content of ch:baselineSearchableText only.

<ch:baselineSearchableText>
    <ITI>§§ 1 - 23 Erster Abschnitt Paßvorschriften</ITI>
</ch:baselineSearchableText>

chb:baselineContent

The element chb:baselineContent contains the actual document content in "baseline" format. The baseline format is, for the most part, identical to XHTML with some extra attributes to the xhtml:a and xhtml:img elements that help content consumers to render the proper links and some restrictions in place to avoid conflicts with the content consumers' CSS. In particular, baseline content MUST NOT contain style attributes and the names of all classes referenced in class attributes MUST start with the prefix "chb-".

Content producers are supposed to generate baseline content that consumers can render within their application with minimal transformation only. In practice, most consumers will re-write the XHTML into "modern" HTML as defined by the WHATWG, and modify link targets to point to their application.

Content consumers are not expected to have detail knowledge about the content structure or the exact semantics of specific baseline content parts. Consumers with use cases that require such knowledge should refer to the ch:nativeContent element.

<chb:baselineContent xmlns="http://www.w3.org/1999/xhtml">
<a id="SUB_HI1005967_absatz_1" class="chb-sectionId"> </a>
</chb:baselineContent>

ch:nativeContent

The element ch:nativeContent is a representation of the same content as found in chb:baselineContent, but in a format specific to the content producer and expected to be semantically much more rich.

Where chb:baselineContent is optimized for ease of use, consumers that process ch:nativeContent are expected to be familiar with the details of the format generated by the respective content producer.

The ContentHub cannot validate the content of ch:nativeContent except that it is well-formed XML. Therefore, content consumers that process ch:nativeContent should notify the respective content producers of their dependency so that any changes to the format can be coordinated.

In case of iDesk content, the production line adds the corresponding snippet of the logical document in Intrabox DTD format as shown below.

<ch:nativeContent>
  <gldg intra-idx="HI1005967">
</ch:nativeContent>