ChunkDocument

Description

Given an input Datavolo Document, chunks the data into segments that are more applicable for LLM synthesis or semantic embedding. The original document is routed to the 'original' relationship, while each of the chunks is routed as plaintext to the 'chunks' relationship.

Properties

In the list below required Properties are shown with an asterisk (*). Other properties are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display Name	API Name	Default Value	Allowable Values	Description
Chunking Strategy *	Chunking Strategy	Section	Section Paragraph	Specifies how the document should be chunked.
Subsection Strategy *	Subsection Strategy	Separate Subsections	Separate Subsections Include Subsections	When a Section is found with one or more subsections, and the Section plus all subsections are small enough to fit within a single chunk, this property specifies how the subsections should be handled. This property is only considered if: the property Chunking Strategy has a value of Section
Max Chunk Size *	Max Chunk Size	16000		The maximum number of characters that should be included in each chunk. Supports Expression Language, using FlowFile attributes and Environment variables. This property is only considered if: the property Chunking Strategy has a value of Section or Paragraph
Chunk Overlap *	Chunk Overlap	200		The number of characters to include from preceding and subsequent chunks. Note that if using a Chunking Strategy of 'Section', the Chunk Overlap will only take effect if multiple chunks are created for a given section due to the Max Chunk Size. Supports Expression Language, using FlowFile attributes and Environment variables.
Include Processing Elements *	Include Processing Elements	true	true false	Specifies whether or not to include processing elements in the chunks.

Dynamic Properties

This component does not support dynamic properties.

Relationships

Name	Description
chunks	The chunks of the document are routed to this Relationship upon successful chunking.
failure	If the text of a FlowFile cannot be extracted for any reason, the input FlowFile will be routed to this relationship.
original	The original document is routed to this Relationship upon successful chunking.

Reads Attributes

This processor does not read attributes.

Writes Attributes

Name	Description
chunk.<key>	The metadata associated with the Document Container that was chunked.
container.id	The ID of the container that the chunk belongs to.
container.title	The title of the container that the chunk belongs to.
document.chunk.max.chars	The maximum number of characters that should be included in each chunk.
document.chunk.overlap	The number of characters from the previous chunk and subsequent chunk that are included in the given FlowFile's text.
document.chunk.processing.elements.included	Specifies whether or not Processing Elements were included in the chunks. One of 'true' or 'false'.
document.chunk.strategy	The strategy that was used to chunk the Document. One of 'Section' or 'Paragraph'.
document.chunk.subsections.included	Specifies whether or not subsections were allowed to be included in the chunks. One of 'true' or 'false'.
document.id	The ID of the Document that was chunked. This is useful for merging the chunks back together with the original Document. The ID that is used is the UUID of the incoming FlowFile.
fragment.count	The total number of chunks that were created from the document, plus 1. The +1 accounts for the Document itself, which is routed to the 'original' relationship and is convenient for use when merging the original Document back together with processing analysis of the chunks.
fragment.index	The index of the chunk within the document.
mime.type	The MIME type of the chunk will be set to text/plain

State Management

This component does not store state.

Restricted

This component is not restricted.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

This component does not specify system resource considerations.

ChunkDocument

Description

Tags

Properties

Dynamic Properties

Relationships

Reads Attributes

Writes Attributes

State Management

Restricted

Input Requirement

System Resource Considerations

See Also

Description​

Tags​

Properties​

Dynamic Properties​

Relationships​

Reads Attributes​

Writes Attributes​

State Management​

Restricted​

Input Requirement​

System Resource Considerations​

See Also​

Description

Tags

Properties

Dynamic Properties

Relationships

Reads Attributes

Writes Attributes

State Management

Restricted

Input Requirement

System Resource Considerations

See Also