MergeDocumentElements
Description
Given a FlowFile that contains a full Document and one more FlowFiles that contain additional data to merge into the Document, this Processor will merge the additional data into the Document. This can be used, for instance, when a table or image has been extracted from a Document and analyzed with a deep learning model in order to glean insights. The derived information can then be merged back into the original Document using this Processor. For each FlowFile that does not contain the full Document, the Processor will create a Processing Element Representation whose 'data' element is the contents of the FlowFile, or will add the contents of the FlowFile to the metadata of the Container that the FlowFile belongs to, if the 'Content Metadata Key' property is set.
Tags
assemble, combine, document, element, fragment, join, merge, rag, retrieval augmented generation, unstructured
Properties
In the list below required Properties are shown with an asterisk (*). Other properties are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.
Display Name | API Name | Default Value | Allowable Values | Description |
---|---|---|---|---|
Timeout * | Timeout | 5 minutes | The amount of time to wait for all document fragments to arrive before merging the documents | |
Content Metadata Key | Content Metadata Key | The key to use for the metadata entry that will contain the content of the FlowFile. If this property is set, the content of each of the FlowFiles will be placed into the Document Container's metadata with the specified key. If not specified, the content of the FlowFile will be added as a Processing Element Representation in the document. Supports Expression Language, using FlowFile attributes and Environment variables. | ||
Character Set * | Character Set | UTF-8 | The Character Set of all FlowFiles' contents. It is required that all FlowFiles that are included have the same Character Set. If any FlowFile has binary content,the FlowFile's contents must first be Base64 Encoded. In this case, it is recommended to include a metadata entry named 'encoding' with a value of 'base64'. Supports Expression Language, using FlowFile attributes and Environment variables. | |
FlowFile Inclusion Filter | FlowFile Inclusion Filter | An Expression Language Expression that can be evaluated against each incoming FlowFile. If the result of the expression is true, the FlowFile will be included in the bin; otherwise, it will be ignored. When a FlowFile is split up and later merged, we must wait for all segments of the original FlowFile to arrive in order to merge them together. This property allows you to specify a filter that can be used to exclude some FlowFiles from the merged document, while still routing the FlowFile to the Processor in order to ensure that all segments of the FlowFile arrive. Supports Expression Language, using FlowFile attributes and Environment variables. | ||
Maximum number of Bins * | Maximum number of Bins | 1000 | Specifies the maximum number of bins that can be held in memory at any one time |
Dynamic Properties
Name | Value | Description |
---|---|---|
User-defined metadata key | User-defined metadata value | Adds a metadata entry to the Document. If the 'Content Metadata Key' property is set, the metadata of the Container will be updated. Otherwise, the entry will be added to the metadata of the Processing Element's Representation that is created for the FlowFile. Supports Expression Language: Yes, evaluated using FlowFile Attributes and Environment variables. |
Relationships
Name | Description |
---|---|
failure | If unable to merge the document elements, the original document fragments are routed to this relationship. |
merged | The merged document is routed to this relationship when all document fragments have been merged together. |
partial | If only some of the document fragments arrive within the timeout period, those that have arrived are merged and routed to this relationship. |
Reads Attributes
Name | Description |
---|---|
container.id | The ID of the Container that the FlowFile belongs to. This is expected to be on all FlowFiles except for the FlowFile containing the full Document |
container.scope | The scope of the Container that the FlowFile belongs to. Exactly one FlowFile in each bin must have a container.scope of 'DOCUMENT' |
document.id | The ID of the Document that all FlowFiles belong to. Each FlowFile that has the same value for this attribute will be placed together into the same "bin" and will be merged together. This attribute is expected to be on all FlowFiles. |
fragment.count | The number of FlowFiles that are expected to be merged together for the Document, including the Document FlowFile itself. This is expected to be on all FlowFiles. |
fragment.index | The index of the fragment that the FlowFile represents. This is expected to be on all FlowFiles except for the FlowFile containing the full Document. |
Writes Attributes
Name | Description |
---|---|
eviction.explanation | A more use-friendly explanation as to why the bin was evicted. |
eviction.reason | The reason that the bin was evicted. I.e., why the Processor determined it was time to merge the document and fragments together. This will be one of 'MAX_ENTRIES_THRESHOLD_REACHED' if all of the document elements were received. It will have a value of 'TIMEOUT' if the timeout period was reached before all document elements arrived. It will have a value of 'BIN_MANAGER_FULL' if the FlowFile was merged due to number of bins reaching the max allowed by the 'Maximum number of Bins' property. |
mime.type | The MIME type will be set to application/json. |
State Management
This component does not store state.
Restricted
This component is not restricted.
Input Requirement
This component requires an incoming relationship.
Example Use Cases Involving Other Components
Multiprocessor Use Case 1
Parse a PDF into a Document object, using OpenAI to summarize any Table that is found in the document.
Components Involved
- ParsePdfDocument
- Set the 'Element Detection Service URL' property to the URL of the Element Detection Service, such as http://document-element-detection-service
- Set the 'OCR Service URL' property to the URL of the OCR Service, such as http://ocr-service
- Set the 'Table Embedding Strategy' property to 'SKIP'
- Connect the 'success' relationship to the 'MergeDocumentElements' Processor
- Auto-terminate the 'images' relationship
- Connect the 'tables' relationship to the 'PromptOpenAI' Processor
- PromptOpenAI
- Set the 'Web Client Service' property to point to a Web Client Service that has been configured with a sufficiently long timeout - 30 seconds is recommended.
- Set the 'OpenAI API Key' to the value of your OpenAI API Key; it is advised to use a Parameter for this property.
- Set 'Prompt Type' to 'Image'
- Set the 'Image Model Name' to the appropriate model for your use case. For example, 'gpt-4o'
- Set the 'System Message' property to 'You are an assistant that is able to look at an image of a table and summarize the data in the table.'
- Set the 'User Message' property to 'Summarize the data in the table.'
- Set the 'Image MIME Type' property to '${mime.type}'
- Leave the 'Image URL' property unset.
- Set the 'Temperature' property to '0.0'
- Connect the 'success' relationship to the 'MergeDocumentElements' Processor
- MergeDocumentElements
- Set the 'Timeout' property to '5 minutes'.
- Set the 'Character Set' property to 'UTF-8'.
- Leave the 'Content Metadata Key' property unset.
- Leave the 'FlowFile Inclusion Filter' property unset.
- Connect the 'merged' relationship to the appropriate next Processor in the flow.
- If desirable, connect the 'partial' relationship to the appropriate next Processor in the flow.
System Resource Considerations
This component does not specify system resource considerations.