This API allows to feed documents into the ContentHub.
API Settings | Help |
Note: This API only supports the Client Credentials flow.
This means that you will log in only using your application's client ID and client secret, without actually authenticating a
user. In effect, any /token
end point of the below Authentication Method(s) can be used for that.
Log in using a local username and password.
Token Endpoint
This API supports the following OAuth2 authorization flows:
The Single Document Ingest is intended for content producers that publish one document at a time.
After that, the admin at Haufe-Lexware will choose a scope for your use-case and inform you about it. You can also notice the assigned scopes at any time by selecting to view an application under the "Applications" tab, then by pressing on the "Select" button to choose for which subscription you are interested to view the scopes.
Now you are ready to use the API. There are two important URLs:
Both of them you can find right above at the top of this page - the token endpoint URL is shown when you unfold the section with the greenish background (entitled "Username and Password (local)").
Now you need to use this command to request a token:
curl --location --request POST '<token-endpoint-url>' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'grant_type=client_credentials' \
--data-urlencode 'client_id=...' \
--data-urlencode 'client_secret=...' \
--data-urlencode 'scope=...'
The response to this request consists of a small JSON object that contains the token in the access_token
property.
All the ingest requests have to contain the token as part of the Authorization
header like this:
curl --location --request POST '<api-url>/ingest/v1/workspaces/<workspace_name>/content' \
--header 'Authorization: Bearer <access-token>'
Inside ContentHub the documents are grouped into workspaces. It is natural to use one workspace per content producing application, so before a new content producing application can start ingesting documents, it has to create the workspace. This step will be done only once, when the content producing application starts ingesting documents for the first time (or if the workspace is deleted and it has to be recreated). The name of the workspace has to be in line with the scope that was provided by the ContentHub administrator.
application_<workspace_name>
. E.g. application_demoWorkspace
, where demoWorkspace
will be
the name of the workspace that has to be created.curl --location --request POST '<api-url>/ingest/v1/workspaces?applicationId=<workspace_name>' \
--header 'Authorization: Bearer <access-token>'
curl --location --request DELETE '<api-url>/ingest/v1/workspaces/<workspace_name>' \
--header 'Authorization: Bearer <access-token>'
curl --location --request GET '<api-url>/ingest/v1/workspaces' \
--header 'Authorization: Bearer <access-token>'
An ingest document represents an individually addressable unit of content together with the associated metadata. The ingest requires an XML document that encodes the metadata and transports the content.
Every ingest XML document follows the following template:
<haufe-document xmlns="http://contenthub.haufe-lexware.com/haufe-document">
<!-- The meta element holds the document's metadata, cf. the ingest metadata description -->
<meta>
<!-- ... -->
</meta>
<!--
The following elements describe the document's textual content.
(Actually, the textual content appears twice - once in baseline and once in native format.)
-->
<preview>
<!-- ... -->
</preview>
<baselineSearchableText>
<!-- ... -->
</baselineSearchabletext>
<baselineContent>
<!-- ... -->
</baselineContent>
<nativeSearchableText>
<!-- ... -->
</nativeSearchableText>
<nativeContent>
<!-- ... -->
</nativeContent>
<!-- possibly future extension elements, for now ignored -->
</haufe-document>
Every document ingested into the content hub needs to have metadata. There are (very few) mandatory metadata fields and a number of optional metadata fields whose semantics are defined by the content hub. In addition, content producers are free to add custom metadata fields as long as they place the respective elements into a custom namespace. This means that a content producer is supposed to use the pre-defined optional fields if a document has metadata that matches the respective field's semantics so consumers can refer to a uniform set of base metadata. On the other hand, if the semantics of metadata available to the producer does not match the semantics of any optional field, then the producer is supposed to place the metada into a custom field.
The elements baselineContent, baselineSearchableText, nativeContent, and nativeSearchableText represent the textual content of the document; in case of baselineContent and nativeContent with markup, in case of baselineSearchableText and nativeSearchabletext as plain text.
The baseline format serves consumers that are not interested in an elaborate, domain driven (and thus producer-specific) markup, but that need to display content to the end user from potentially many producers with as less hassle as possible. The format therefore has to be close to HTML. Since content in baseline format is used inside content hub documents that are XML documents, it should be an XML format as well. Together that makes a format derived from XHTML the best fit.
The preview field should contain a teaser from the baseline content. For this reason it can be said that the preview element shares with baseline the xml schema and includes the xhtml modules as well as the attributes and elements. The preview can be provided by the content producer. If not provided, the ingest service will generate one by taking a teaser of 500 characters from the baseline content.
There will also be consumers that "understand" and need the elaborate, domain-driven content markup of specific applications; for these consumers, the baseline format is insufficient.
The content hub therefore expects content producers to also submit their content in an application (or even document-type) specific format - the so-called native format - that transports as much of the domain-specific markup as the application is willing to share.
The content hub makes no attempt to interpret or analyze the content in native format. It merely stores and forwards the native content. The only assumption is that the native content - if present - is represented as a single well-formed XML document. (In particular, the native content MUST have a single root element, so the nativeContent element is either empty or includes a single child element.)
The (optional) elements baselineSearchableText and nativeSearchableText are used to submit such text fragments that are to be indexed in addition to the text content of the baselineContent and nativeContent elements, respectively. The searchable text elements may hold any mixed content.
curl --location --request POST '<api-url>/ingest/v1/workspaces/demoWorkspace/content' \
--header 'Authorization: Bearer <access-token>'
--header 'Content-Type: application/atom+xml'
--data @contentHubDocument.xml
<entry xmlns="http://www.w3.org/2005/Atom">
<content type="application/vnd.haufegroup.chsingledocingest+xml">
<ch:haufe-document xmlns:ch="http://contenthub.haufe-lexware.com/haufe-document">
<ch:meta>
<ch:chronologicalSortDate>2022-09-27T00:15:00+02:00</ch:chronologicalSortDate>
<ch:visible>true</ch:visible>
<ch:contentLanguage>de</ch:contentLanguage>
<ch:tag>Steuern</ch:tag>
<ch:tag>Gesetzgebung</ch:tag>
<ch:title name="default">Kabinett beschließt Jahressteuergesetz 2022</ch:title>
<ch:canonicalUrl>https://www.haufe.de/personal/entgelt/jahressteuergesetz-2022_78_575614.html</ch:canonicalUrl>
<ch:appDocId>57561444</ch:appDocId>
<ch:application>demoWorkspace</ch:application>
<ch:contentHubId>contenthub://demoWorkspace/57561444</ch:contentHubId>
<ch:documentType>News</ch:documentType>
<pr:rightsHolder xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata"/>
<pr:isNotVisibleInRecommendationBox xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata">false</pr:isNotVisibleInRecommendationBox>
<pr:feederstate xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata">SUCCESS</pr:feederstate>
<pr:overline xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata">Jahressteuergesetz</pr:overline>
<pr:category xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata">Personal</pr:category>
<pr:subcategory xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata">Entgelt</pr:subcategory>
<pr:feedertime xmlns:pr="http://www.haufe.de/namespaces/portals/contenthub/metadata" xmlns:xs="http://www.w3.org/2001/XMLSchema" xs:type="dateTime">2022-09-28T09:27:09.579112+02:00</pr:feedertime>
</ch:meta>
<chb:baselineContent xmlns="http://www.w3.org/1999/xhtml" xmlns:chb="http://contenthub.haufe-lexware.com/baseline-format-schema">
<div>
<div class="chb-text">
<p>Das Jahressteuergesetz 2022 soll Entlastungen bringen und wird voraussichtlich
<strong>Ende Oktober in den Bundestag</strong> kommen. Aus lohnsteuerlicher Sicht ist vor allem auf folgende Punkte hinzuweisen:
</p>
<h2>Rentenbeiträge voll absetzbar</h2>
<p>Der vollständige Abzug von
<strong>Altersvorsorgeaufwendungen </strong>als Sonderausgaben soll bereits ab dem Jahr 2023 (statt erstmals im Jahr 2025) möglich sein. Das hat auch Auswirkungen auf die Berücksichtigung im Rahmen der sogenannten Vorsorgepauschale im Lohnsteuerabzugsverfahren.
</p>
</div>
</div>
</chb:baselineContent>
<ch:nativeContent>
<noNativeContent/>
</ch:nativeContent>
</ch:haufe-document>
</content>
</entry>
Documents regularly contain images (e.g. jpg or png) and other binaries like MS Word or pdf documents. Moreover they are often packaged together as a composite document.
Composite documents will be uploaded as multipart documents with content type multipart/* (e.g. multipart/form-data or multipart/mixed). The first part will consist of a valid haufedoc Atom entry (the main content) with content type application/atom+xml. The main content contains blob and/or img elements in its meta section.Each additional part consists of binary content and corresponds to one of the main content's blob or img elements.
Main content and all attachments are written to the database. Attachments will be stored in a way which allows discovery of their relationship to the parent main content document. As a simple rule, for every attachment there must be a corresponding blob or img element, conversely for every blob or img element there must also be a corresponding attachment (no dangling parts/elements). Initially composite documents can be loaded via the single doc ingest api.
Here is an example of some img and blob elements (from the meta section):
<ch:haufe-document
xmlns:ch="http://contenthub.haufe-lexware.com/haufe-document"
xmlns:chb="http://contenthub.haufe-lexware.com/baseline-format-schema">
<ch:meta>
<ch:blob locator="blobtest.docx" name="blobtest_docx" type="application/docx"></ch:blob>
<ch:img width="101" height="110" type="image/png" locator="imgtest.png" name="imgtest_png"></ch:img>
</ch:meta>
<chb:preview xmlns="http://www.w3.org/1999/xhtml" autogenerated="false">
</chb:preview>
<chb:baselineContent xmlns="http://www.w3.org/1999/xhtml">blobtest</chb:baselineContent>
<ch:nativeContent/>
</ch:haufe-document>
curl --location --request POST '<api-url>/ingest/v1/workspaces/demoWorkspace/content' \
--header 'Authorization: Bearer <access-token>'
--header 'Content-Type: multipart/form-data'
--form 'atom=@multipartxml.xml;type=application/atom+xml'
--form 'imgtest_png=@imgtest.png'
First part (main content) from the curl request example:
<atom:entry xmlns:atom="http://www.w3.org/2005/Atom">
<atom:id>urn:nase:040366</atom:id>
<atom:content type="application/vnd.haufegroup.chsingledocingest+xml">
<ch:haufe-document xmlns:am="http://idesk.haufe-lexware.com/document-meta"
xmlns:ch="http://contenthub.haufe-lexware.com/haufe-document"
xmlns:chb="http://contenthub.haufe-lexware.com/baseline-format-schema">
<ch:meta>
<ch:img width="400" locator="imgtest.png" name="imgtest_png" type="image/png"></ch:img>
<ch:quickSearchPhrase>HI131121001</ch:quickSearchPhrase>
<ch:publisher>HAUFE</ch:publisher>
<ch:title>blobtest</ch:title>
<ch:title name="compoundTitle">tabellentest</ch:title>
<ch:appDocId>HI131121001</ch:appDocId>
<ch:application>portals</ch:application>
<ch:contentHubId>contenthub://portals/HI131121001</ch:contentHubId>
<ch:documentType>BEITRAG</ch:documentType>
<am:language>de</am:language>
<am:rootId>HI131121001</am:rootId>
<am:documentClassification>BEITRAG_BEITRAG</am:documentClassification>
<am:documentType>BEITRAG</am:documentType>
<am:documentCategory>BEITRAG</am:documentCategory>
<am:outlinePath>HI131121001/</am:outlinePath>
<am:isRoot>true</am:isRoot>
</ch:meta>
<chb:preview xmlns="http://www.w3.org/1999/xhtml" autogenerated="false">
</chb:preview>
<chb:baselineContent xmlns="http://www.w3.org/1999/xhtml">blobtest</chb:baselineContent>
<ch:nativeContent/>
</ch:haufe-document>
</atom:content>
</atom:entry>
In order to perform a bulk-ingest one has to provide a bulk-ingest-archive. This is a ZIP-file containing:
Each document must be a valid contenthub-haufe-document. In its meta-data it must be assigned to
the application
which corresponds to the workspace. It must have a appDocId
unique within the
workspace. And it must refer to all its blobs belonging to the document. Here is an example how
a minimal document may look alike:
<?xml version="1.0" encoding="utf-8"?>
<ch:haufe-document
xmlns:ch="http://contenthub.haufe-lexware.com/haufe-document"
xmlns:chb="http://contenthub.haufe-lexware.com/baseline-format-schema">
<ch:meta>
<ch:title>Example Document</ch:title>
<ch:application>apidoc</ch:application>
<ch:appDocId>ED1234</ch:appDocId>
<ch:contentHubId>contenthub://apidoc/ED1234</ch:contentHubId>
<ch:documentType>EXAMPLE</ch:documentType>
<ch:img type="application/png" name="img1" locator="img1.png" width="300"/>
<ch:img type="application/png" name="img2" locator="img2.png" width="100"/>
</ch:meta>
<chb:baselineContent xmlns="http://www.w3.org/1999/xhtml">empty content</chb:baselineContent>
<ch:nativeContent></ch:nativeContent>
</ch:haufe-document>
application
is the application_id this document belongs to. It must match the
application_id of the later bulk-ingest-request.appDocId
is the document's id within the application's workspace. It must be unique
within this workspace.contentHubId
is derived from application
and appDocId
to construct an identifier,
globally unique within contenthub over all workspaces and applications.<ch:img type="application/png" name="img1" locator="img1.png"/>
is a reference to a
blob, here an image with the name img1
and the locator being a filename img1.png
.
The image-name must be unique within the document. No other blob this document refers
to can have the same name as any other blob of this document. The same is true for the
locator. Note that a blob-name can also be used to construct a contentHub-globally
unique identifier: contenthub://apidoc/ED1234#img1
The name of a blob will never be changed. This locator on the other hand is subject to
change. The locator must be valid in the current storage context; meaning:A blob is binary data without any meta-data. It can be anything. An image, PDF or simply a text file. Blobs will not be indexed by contentHub. So they will never pop up in a search result. They are just considered document attachments.
The manifest describes all contents of a bulk-ingest-archive. It must declare all documents and all blobs contained it the archive. Also a manifest can declare which documents are to be removed from the workspace when performing the bulk-ingest-job.
ContentHub supports two different manifest-schemas. Each schema dictates how the bulk-ingest-archive is build. When unsure which schema to choose pick the latest one. Older schemas are subject to deprecation and removal.
Bulk-Ingest-Manifest-20221108.xsd (in Haufe-network only)
Bulk-Ingest-Manifest-20221108.xsd (from service)
The 2022-11-08 schema allows document- and blob-files to be placed at any location within the archive. All paths given in the manifest are considered as absolute paths within the archive. An archive may have the following internal structure:
META-INF/manifest.xml
somewhere/ED1234.xml
anywhere/img1.png
anywhereelse/img2.png
The manifest for this archive should look like this:
<manifest archive-id="example-1"
production-time="2022-11-17T10:00:00"
application-id="apidoc"
xmlns="http://contenthub.haufe-lexware.com/bulk-ingest/manifest/2022-11-08">
<number-of-entries>2</number-of-entries>
<number-of-blobs>2</number-of-blobs>
<entries>
<updated-entry>
<path>somewhere/ED1234.xml</path>
<blobs>
<blob><path>anywhere/img1.png</path><name>img1</name><blob>
<blob><path>anywhereelse/img2.png</path><name>img2</name><blob>
</blobs>
</updated-entry>
<removed-entry><path>ED5566</path></removed-entry>
</entries>
</manifest>
This manifest describes the update of document ED1234
and the removal of document ED5566
.
Thus the number-of-entries
is 2. Since ED1234
has two blobs the manifest has to associate
them with the document. This is done via the blob-name-element. Here the name of the blob has
to match the name of the blob in the document-meta-data-section. At last number-of-blobs
must match the number of blobs declared in this manifest.
For backwards compatibility with the 2018-09-28-schema the 2022-11-08-schema does also allow to
reference a blob by its locator declared in the document's meta-data instead of its name. Using
again the same document ED1234
shown above. The manifest using locator-references instead of
name-reference may look like this:
<manifest archive-id="example-1"
production-time="2022-11-17T10:00:00"
application-id="apidoc"
xmlns="http://contenthub.haufe-lexware.com/bulk-ingest/manifest/2022-11-08">
<number-of-entries>2</number-of-entries>
<number-of-blobs>2</number-of-blobs>
<entries>
<updated-entry>
<path>somewhere/ED1234.xml</path>
<blobs>
<blob><path>anywhere/img1.png</path><locator>img1.png</locator><blob>
<blob><path>anywhereelse/img2.png</path><locator>img2.png</locator><blob>
</blobs>
</updated-entry>
<removed-entry><appDocId>ED5566</appDocId></removed-entry>
</entries>
</manifest>
Note that only one reference is allowed per blob. Preferably use the name
-reference.
Declaring both references on a blob in the manifest will lead to a validation error.
Bulk-Ingest-Manifest-20180928.xsd (in Haufe-network only)
Bulk-Ingest-Manifest-20180928.xsd (from service)
The 2018-09-28 schema introduced the support for blobs. Concerning paths it is much more
restricted than the 2022-11-08-schema: Documents must be placed under a path starting with
documents/
. The path below documents/
can be chosen freely. The path to a blob-file is
restricted. It must follow the scheme:
blobs/{same-path-as-document}/{document-filename}/{blob-fileaname}
The same archive as above does look like as follows in the 2018-09-28-schema:
META-INF/manifest.xml
documents/somewhere/ED1234.xml
blobs/somewhere/ED1234.xml/img1.png
blobs/somewhere/ED1234.xml/img2.png
The manifest for this bulk-ingest-archive looks like this:
<manifest archive-id="example-1"
production-time="2018-10-26T10:00:0"
application-id="apidoc"
xmlns="http://contenthub.haufe-lexware.com/bulk-ingest/manifest/2018-09-28">
<number-of-entries>2</number-of-entries>
<number-of-blobs>2</number-of-blobs>
<entries>
<updated-entry>
<path>somewhere/ED1234.xml</path>
<blobs>
<blob><path>img1.png</path></blob>
<blob><path>img2.png</path></blob>
</blobs>
</updated-entry>
<removed-entry>
<appDocId>ED5566</appDocId>
</removed-entry>
</entries>
</manifest>
View Swagger definition »You are currently not logged in, so we can't display your registered applications. Please log in first.