ingest

This API allows to feed documents into the ContentHub.


API Settings

Help


Note: This API only supports the Client Credentials flow. This means that you will log in only using your application's client ID and client secret, without actually authenticating a user. In effect, any /token end point of the below Authentication Method(s) can be used for that.


Log in using a local username and password.

Token Endpoint

This API supports the following OAuth2 authorization flows:

Single Document Ingest

The Single Document Ingest is intended for content producers that publish one document at a time.

Fetch a token

  1. Sign up for the portal
  2. Register your application
    1. Leave the checkbox for OAuth2.0 Flows unchecked
    2. In case you are using a Server side application, make sure to select "Confidential: Server side application" from the Client Type dropdown
  3. Subscribe to the ingest API by clicking on the "Subscribe" button at the end of this page
    1. Leave the checkbox "Trust this application" unchecked
    2. Choose the trial or unlimited plan depending on your needs

After that, the admin at Haufe-Lexware will choose a scope for your use-case and inform you about it. You can also notice the assigned scopes at any time by selecting to view an application under the "Applications" tab, then by pressing on the "Select" button to choose for which subscription you are interested to view the scopes.

Now you are ready to use the API. There are two important URLs:

  1. The token endpoint URL
  2. The API URL

Both of them you can find right above at the top of this page - the token endpoint URL is shown when you unfold the section with the greenish background (entitled "Username and Password (local)").

Now you need to use this command to request a token:

curl --location --request POST '<token-endpoint-url>'  \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'grant_type=client_credentials' \
--data-urlencode 'client_id=...' \
--data-urlencode 'client_secret=...' \
--data-urlencode 'scope=...'

The response to this request consists of a small JSON object that contains the token in the access_token property. All the ingest requests have to contain the token as part of the Authorization header like this:

curl --location --request POST '<api-url>/ingest/v1/workspaces/<workspace_name>/content' \
--header 'Authorization: Bearer <access-token>'

Workspaces

Inside ContentHub the documents are grouped into workspaces. It is natural to use one workspace per content producing application, so before a new content producing application can start ingesting documents, it has to create the workspace. This step will be done only once, when the content producing application starts ingesting documents for the first time (or if the workspace is deleted and it has to be recreated). The name of the workspace has to be in line with the scope that was provided by the ContentHub administrator.

  • The scope will always be application_<workspace_name>. E.g. application_demoWorkspace, where demoWorkspace will be the name of the workspace that has to be created.

Create a workspace

curl --location --request POST '<api-url>/ingest/v1/workspaces?applicationId=<workspace_name>' \
--header 'Authorization: Bearer <access-token>'

Delete a workspace

curl --location --request DELETE '<api-url>/ingest/v1/workspaces/<workspace_name>' \
--header 'Authorization: Bearer <access-token>'

List workspaces

curl --location --request GET '<api-url>/ingest/v1/workspaces' \
--header 'Authorization: Bearer <access-token>'

Document ingestion

An ingest document represents an individually addressable unit of content together with the associated metadata. The ingest requires an XML document that encodes the metadata and transports the content.

Ingest document schema

Every ingest XML document follows the following template:

<haufe-document xmlns="http://contenthub.haufe-lexware.com/haufe-document">
    <!-- The meta element holds the document's metadata, cf. the ingest metadata description -->
    <meta>
        <!-- ... -->
    </meta>
    <!--
      The following elements describe the document's textual content.
      (Actually, the textual content appears twice - once in baseline and once in native format.)
    -->
    <preview>
        <!-- ... -->
    </preview>
    <baselineSearchableText>
        <!-- ... -->
    </baselineSearchabletext>
    <baselineContent>
        <!-- ... -->
    </baselineContent>
    <nativeSearchableText>
        <!-- ... -->
    </nativeSearchableText>
    <nativeContent>
        <!-- ... -->
    </nativeContent>
     <!-- possibly future extension elements, for now ignored -->
</haufe-document>

Metadata

Every document ingested into the content hub needs to have metadata. There are (very few) mandatory metadata fields and a number of optional metadata fields whose semantics are defined by the content hub. In addition, content producers are free to add custom metadata fields as long as they place the respective elements into a custom namespace. This means that a content producer is supposed to use the pre-defined optional fields if a document has metadata that matches the respective field's semantics so consumers can refer to a uniform set of base metadata. On the other hand, if the semantics of metadata available to the producer does not match the semantics of any optional field, then the producer is supposed to place the metada into a custom field.

Metadata schema

Textual content

The elements baselineContent, baselineSearchableText, nativeContent, and nativeSearchableText represent the textual content of the document; in case of baselineContent and nativeContent with markup, in case of baselineSearchableText and nativeSearchabletext as plain text.

Baseline format

The baseline format serves consumers that are not interested in an elaborate, domain driven (and thus producer-specific) markup, but that need to display content to the end user from potentially many producers with as less hassle as possible. The format therefore has to be close to HTML. Since content in baseline format is used inside content hub documents that are XML documents, it should be an XML format as well. Together that makes a format derived from XHTML the best fit.

Baseline schemas

Preview format

The preview field should contain a teaser from the baseline content. For this reason it can be said that the preview element shares with baseline the xml schema and includes the xhtml modules as well as the attributes and elements. The preview can be provided by the content producer. If not provided, the ingest service will generate one by taking a teaser of 500 characters from the baseline content.

Native format

There will also be consumers that "understand" and need the elaborate, domain-driven content markup of specific applications; for these consumers, the baseline format is insufficient.

The content hub therefore expects content producers to also submit their content in an application (or even document-type) specific format - the so-called native format - that transports as much of the domain-specific markup as the application is willing to share.

The content hub makes no attempt to interpret or analyze the content in native format. It merely stores and forwards the native content. The only assumption is that the native content - if present - is represented as a single well-formed XML document. (In particular, the native content MUST have a single root element, so the nativeContent element is either empty or includes a single child element.)

baselineSearchableText and nativeSearchableText

The (optional) elements baselineSearchableText and nativeSearchableText are used to submit such text fragments that are to be indexed in addition to the text content of the baselineContent and nativeContent elements, respectively. The searchable text elements may hold any mixed content.

Ingest document example

curl --location --request POST '<api-url>/ingest/v1/workspaces/demoWorkspace/content' \
--header 'Authorization: Bearer <access-token>'
--header 'Content-Type: application/atom+xml'
--data @contentHubDocument.xml

contentHubDocument.xml

<entry xmlns="http://www.w3.org/2005/Atom">
   <content type="application/vnd.haufegroup.chsingledocingest+xml">
      <ch:haufe-document xmlns:ch="http://contenthub.haufe-lexware.com/haufe-document">
          <ch:meta>
              <ch:chronologicalSortDate>2022-09-27T00:15:00+02:00</ch:chronologicalSortDate>
              <ch:visible>true</ch:visible>
              <ch:contentLanguage>de</ch:contentLanguage>
              <ch:tag>Steuern</ch:tag>
              <ch:tag>Gesetzgebung</ch:tag>
              <ch:title name="default">Kabinett beschließt Jahressteuergesetz 2022</ch:title>
              <ch:canonicalUrl>https://www.haufe.de/personal/entgelt/jahressteuergesetz-2022_78_575614.html</ch:canonicalUrl>
              <ch:appDocId>57561444</ch:appDocId>
              <ch:application>demoWorkspace</ch:application>
              <ch:contentHubId>contenthub://demoWorkspace/57561444</ch:contentHubId>
              <ch:documentType>News</ch:documentType>
              <pr:rightsHolder xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata"/>
              <pr:isNotVisibleInRecommendationBox xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata">false</pr:isNotVisibleInRecommendationBox>
              <pr:feederstate xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata">SUCCESS</pr:feederstate>
              <pr:overline xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata">Jahressteuergesetz</pr:overline>
              <pr:category xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata">Personal</pr:category>
              <pr:subcategory xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata">Entgelt</pr:subcategory>
              <pr:feedertime xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata" xmlns:xs="http://www.w3.org/2001/XMLSchema" xs:type="dateTime">2022-09-28T09:27:09.579112+02:00</pr:feedertime>
          </ch:meta>
          <chb:baselineContent xmlns="http://www.w3.org/1999/xhtml" xmlns:chb="http://contenthub.haufe-lexware.com/baseline-format-schema">
              <div>
                  <div class="chb-text">
                      <p>Das Jahressteuergesetz 2022 soll Entlastungen bringen und wird voraussichtlich 
                          <strong>Ende Oktober in den Bundestag</strong> kommen. Aus lohnsteuerlicher Sicht ist vor allem auf folgende Punkte hinzuweisen:
                      </p>
                      <h2>Rentenbeiträge voll absetzbar</h2>
                      <p>Der vollständige Abzug von 
                          <strong>Altersvorsorgeaufwendungen </strong>als Sonderausgaben soll bereits ab dem Jahr 2023 (statt erstmals im Jahr 2025) möglich sein. Das hat auch Auswirkungen auf die Berücksichtigung im Rahmen der sogenannten Vorsorgepauschale im Lohnsteuerabzugsverfahren.
                      </p>
                  </div>
              </div>
          </chb:baselineContent>
          <ch:nativeContent>
              <noNativeContent/>
          </ch:nativeContent>
      </ch:haufe-document>
   </content>
</entry>

Ingesting Documents with Binary attachments (Multipart documents)

Documents regularly contain images (e.g. jpg or png) and other binaries like MS Word or pdf documents. Moreover they are often packaged together as a composite document.

Composite documents will be uploaded as multipart documents with content type multipart/* (e.g. multipart/form-data or multipart/mixed). The first part will consist of a valid haufedoc Atom entry (the main content) with content type application/atom+xml. The main content contains blob and/or img elements in its meta section.Each additional part consists of binary content and corresponds to one of the main content's blob or img elements.

Main content and all attachments are written to the database. Attachments will be stored in a way which allows discovery of their relationship to the parent main content document. As a simple rule, for every attachment there must be a corresponding blob or img element, conversely for every blob or img element there must also be a corresponding attachment (no dangling parts/elements). Initially composite documents can be loaded via the single doc ingest api.

Here is an example of some img and blob elements (from the meta section):

<ch:haufe-document
    xmlns:ch="http://contenthub.haufe-lexware.com/haufe-document"
    xmlns:chb="http://contenthub.haufe-lexware.com/baseline-format-schema">
<ch:meta>
<ch:blob locator="blobtest.docx" name="blobtest_docx" type="application/docx"></ch:blob>
<ch:img width="101" height="110" type="image/png" locator="imgtest.png" name="imgtest_png"></ch:img>
</ch:meta>
<chb:preview xmlns="http://www.w3.org/1999/xhtml" autogenerated="false">
</chb:preview>
<chb:baselineContent xmlns="http://www.w3.org/1999/xhtml">blobtest</chb:baselineContent>
<ch:nativeContent/>
</ch:haufe-document>

Ingest multipart document example

curl --location --request POST '<api-url>/ingest/v1/workspaces/demoWorkspace/content' \
--header 'Authorization: Bearer <access-token>'
--header 'Content-Type: multipart/form-data'
--form 'atom=@multipartxml.xml;type=application/atom+xml'
--form 'imgtest_png=@imgtest.png'

First part (main content) from the curl request example:

<atom:entry xmlns:atom="http://www.w3.org/2005/Atom">
<atom:id>urn:nase:040366</atom:id>
<atom:content type="application/vnd.haufegroup.chsingledocingest+xml">
<ch:haufe-document xmlns:am="http://idesk.haufe-lexware.com/document-meta"
    xmlns:ch="http://contenthub.haufe-lexware.com/haufe-document"
    xmlns:chb="http://contenthub.haufe-lexware.com/baseline-format-schema">
<ch:meta>
<ch:img width="400" locator="imgtest.png" name="imgtest_png" type="image/png"></ch:img>
<ch:quickSearchPhrase>HI131121001</ch:quickSearchPhrase>
<ch:publisher>HAUFE</ch:publisher>
<ch:title>blobtest</ch:title>
<ch:title name="compoundTitle">tabellentest</ch:title>
<ch:appDocId>HI131121001</ch:appDocId>
<ch:application>portals</ch:application>
<ch:contentHubId>contenthub://portals/HI131121001</ch:contentHubId>
<ch:documentType>BEITRAG</ch:documentType>
<am:language>de</am:language>
<am:rootId>HI131121001</am:rootId>
<am:documentClassification>BEITRAG_BEITRAG</am:documentClassification>
<am:documentType>BEITRAG</am:documentType>
<am:documentCategory>BEITRAG</am:documentCategory>
<am:outlinePath>HI131121001/</am:outlinePath>
<am:isRoot>true</am:isRoot>
</ch:meta>
<chb:preview xmlns="http://www.w3.org/1999/xhtml" autogenerated="false">
</chb:preview>
<chb:baselineContent xmlns="http://www.w3.org/1999/xhtml">blobtest</chb:baselineContent>
<ch:nativeContent/>
</ch:haufe-document>
</atom:content>
</atom:entry>

Bulk-Ingest-Archive

In order to perform a bulk-ingest one has to provide a bulk-ingest-archive. This is a ZIP-file containing:

  • one or many documents
  • zero or many blobs
  • exactly one manifest

Parts of a Bulk-Ingest-Archive

Document

Each document must be a valid contenthub-haufe-document. In its meta-data it must be assigned to the application which corresponds to the workspace. It must have a appDocId unique within the workspace. And it must refer to all its blobs belonging to the document. Here is an example how a minimal document may look alike:

<?xml version="1.0" encoding="utf-8"?>
<ch:haufe-document
        xmlns:ch="http://contenthub.haufe-lexware.com/haufe-document"
        xmlns:chb="http://contenthub.haufe-lexware.com/baseline-format-schema">
    <ch:meta>
        <ch:title>Example Document</ch:title>
        <ch:application>apidoc</ch:application>
        <ch:appDocId>ED1234</ch:appDocId>
        <ch:contentHubId>contenthub://apidoc/ED1234</ch:contentHubId>
        <ch:documentType>EXAMPLE</ch:documentType>
        <ch:img type="application/png" name="img1" locator="img1.png" width="300"/>
        <ch:img type="application/png" name="img2" locator="img2.png" width="100"/>
    </ch:meta>
    <chb:baselineContent xmlns="http://www.w3.org/1999/xhtml">empty content</chb:baselineContent>
    <ch:nativeContent></ch:nativeContent>
</ch:haufe-document>
  1. application is the application_id this document belongs to. It must match the application_id of the later bulk-ingest-request.
  2. appDocId is the document's id within the application's workspace. It must be unique within this workspace.
  3. contentHubId is derived from application and appDocId to construct an identifier, globally unique within contenthub over all workspaces and applications.
  4. <ch:img type="application/png" name="img1" locator="img1.png"/> is a reference to a blob, here an image with the name img1 and the locator being a filename img1.png. The image-name must be unique within the document. No other blob this document refers to can have the same name as any other blob of this document. The same is true for the locator. Note that a blob-name can also be used to construct a contentHub-globally unique identifier: contenthub://apidoc/ED1234#img1 The name of a blob will never be changed. This locator on the other hand is subject to change. The locator must be valid in the current storage context; meaning:
    1. for a document stored in a bulk-ingest-archive the locator is a simple filename;
    2. when the document is stored in contentHub's database the locator will be replaced with the location where the blob is stored in the database;
    3. when the document is delivered in a multipart-retrieval-request the locator might be a simple filename again;
    4. when the document is delivered in a export-archive the locator is set to the absolute location of the blob-file within the archive;

Blob

A blob is binary data without any meta-data. It can be anything. An image, PDF or simply a text file. Blobs will not be indexed by contentHub. So they will never pop up in a search result. They are just considered document attachments.

Manifest

The manifest describes all contents of a bulk-ingest-archive. It must declare all documents and all blobs contained it the archive. Also a manifest can declare which documents are to be removed from the workspace when performing the bulk-ingest-job.

ContentHub supports two different manifest-schemas. Each schema dictates how the bulk-ingest-archive is build. When unsure which schema to choose pick the latest one. Older schemas are subject to deprecation and removal.

Manifest-Schema 2022-11-08

Bulk-Ingest-Manifest-20221108.xsd (in Haufe-network only)

Bulk-Ingest-Manifest-20221108.xsd (from service)

The 2022-11-08 schema allows document- and blob-files to be placed at any location within the archive. All paths given in the manifest are considered as absolute paths within the archive. An archive may have the following internal structure:

META-INF/manifest.xml
somewhere/ED1234.xml
anywhere/img1.png
anywhereelse/img2.png

The manifest for this archive should look like this:

<manifest archive-id="example-1"
          production-time="2022-11-17T10:00:00"
          application-id="apidoc"
          xmlns="http://contenthub.haufe-lexware.com/bulk-ingest/manifest/2022-11-08">
    <number-of-entries>2</number-of-entries>
    <number-of-blobs>2</number-of-blobs>
    <entries>
        <updated-entry>
            <path>somewhere/ED1234.xml</path>
            <blobs>
                <blob><path>anywhere/img1.png</path><name>img1</name><blob>
                <blob><path>anywhereelse/img2.png</path><name>img2</name><blob>
            </blobs>
        </updated-entry>
        <removed-entry><path>ED5566</path></removed-entry>
    </entries>
</manifest>

This manifest describes the update of document ED1234 and the removal of document ED5566. Thus the number-of-entries is 2. Since ED1234 has two blobs the manifest has to associate them with the document. This is done via the blob-name-element. Here the name of the blob has to match the name of the blob in the document-meta-data-section. At last number-of-blobs must match the number of blobs declared in this manifest.

Referencing a blob by locator

For backwards compatibility with the 2018-09-28-schema the 2022-11-08-schema does also allow to reference a blob by its locator declared in the document's meta-data instead of its name. Using again the same document ED1234 shown above. The manifest using locator-references instead of name-reference may look like this:

<manifest archive-id="example-1"
          production-time="2022-11-17T10:00:00"
          application-id="apidoc"
          xmlns="http://contenthub.haufe-lexware.com/bulk-ingest/manifest/2022-11-08">
    <number-of-entries>2</number-of-entries>
    <number-of-blobs>2</number-of-blobs>
    <entries>
        <updated-entry>
            <path>somewhere/ED1234.xml</path>
            <blobs>
                <blob><path>anywhere/img1.png</path><locator>img1.png</locator><blob>
                <blob><path>anywhereelse/img2.png</path><locator>img2.png</locator><blob>
            </blobs>
        </updated-entry>
        <removed-entry><appDocId>ED5566</appDocId></removed-entry>
    </entries>
</manifest>

Note that only one reference is allowed per blob. Preferably use the name-reference. Declaring both references on a blob in the manifest will lead to a validation error.

Manifest-Schema 2018-09-28

Bulk-Ingest-Manifest-20180928.xsd (in Haufe-network only)

Bulk-Ingest-Manifest-20180928.xsd (from service)

The 2018-09-28 schema introduced the support for blobs. Concerning paths it is much more restricted than the 2022-11-08-schema: Documents must be placed under a path starting with documents/. The path below documents/ can be chosen freely. The path to a blob-file is restricted. It must follow the scheme:

blobs/{same-path-as-document}/{document-filename}/{blob-fileaname}

The same archive as above does look like as follows in the 2018-09-28-schema:

META-INF/manifest.xml
documents/somewhere/ED1234.xml
blobs/somewhere/ED1234.xml/img1.png
blobs/somewhere/ED1234.xml/img2.png

The manifest for this bulk-ingest-archive looks like this:

<manifest archive-id="example-1"
          production-time="2018-10-26T10:00:0"
          application-id="apidoc"
          xmlns="http://contenthub.haufe-lexware.com/bulk-ingest/manifest/2018-09-28">
    <number-of-entries>2</number-of-entries>
    <number-of-blobs>2</number-of-blobs>
    <entries>
        <updated-entry>
            <path>somewhere/ED1234.xml</path>
            <blobs>
                <blob><path>img1.png</path></blob>
                <blob><path>img2.png</path></blob>
            </blobs>
        </updated-entry>
        <removed-entry>
            <appDocId>ED5566</appDocId>
        </removed-entry>
    </entries>
</manifest>
View Swagger definition »

Not logged in

You are currently not logged in, so we can't display your registered applications. Please log in first.

Log in »