PerformOCR

Description

Uses the Datavolo Tesseract OCR Service to extract text from a PDF or image, optionally providing metadata including the bounding box, page number and confidence level of the OCR.

Properties

In the list below required Properties are shown with an asterisk (*). Other properties are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display Name	API Name	Default Value	Allowable Values	Description
OCR Service *	OCR Service		Controller Service: OCRService Implementations: StandardOCRService	An OCR Service for reading files to output text.
MIME Type *	MIME Type	application/pdf		The MIME Type of the input FlowFile. This is used to determine the format of the input data. Supports Expression Language, using FlowFile attributes and Environment variables.
Extract PDF Text *	Extract PDF Text	true	true false	If true, the processor will attempt to extract text directly from the PDF files, rather than performing OCR. This can be more efficient and provide better results in many cases. In the case that text is not available in the PDF, OCR will be performed regardless of this setting. This property is only considered if: the property MIME Type has a value of application/pdf
Record Writer	Record Writer		Controller Service: RecordSetWriterFactory Implementations: AvroRecordSetWriter CSVRecordSetWriter FreeFormTextRecordSetWriter JsonRecordSetWriter RecordSetWriterLookup ScriptedRecordSetWriter XMLRecordSetWriter	Specifies the Controller Service to use for writing the results. If not specified, the results will be written to the FlowFile as plaintext. If the Record Writer is specified, each text block will be output as an individual Record. In this case, the Record will contain not only the text that was found but also the bounding box in the image/pdf where the text was found, as well as the page number and the confidence level of the OCR. Each Record will have the following fields: `text`, `x`, `y`, `height`, `width`, `pageNumber`, and `confidence`.
Confidence Threshold *	Confidence Threshold	60		The minimum confidence level required for a text block to be included in the output. Text blocks with a confidence level below this value will be excluded. Supports Expression Language, using FlowFile attributes and Environment variables.

Dynamic Properties

This component does not support dynamic properties.

Relationships

Name	Description
comms.failure	If the processor is unable to communicate with the Tesseract OCR Service, the input FlowFile will be routed to this relationship.
failure	If the text of a FlowFile cannot be extracted for any reason, the input FlowFile will be routed to this relationship.
success	The text of the PDF is routed to the success relationship.

Reads Attributes

This processor does not read attributes.

Writes Attributes

Name	Description
mime.type	The MIME Type of the FlowFile.
text.extraction.method	The method used to extract the text from the FlowFile. This will be either 'PdfExtraction' or 'OCR'.

State Management

This component does not store state.

Restricted

This component is not restricted.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

This component does not specify system resource considerations.

PerformOCR

Description

Tags

Properties

Dynamic Properties

Relationships

Reads Attributes

Writes Attributes

State Management

Restricted

Input Requirement

System Resource Considerations

See Also

Description​

Tags​

Properties​

Dynamic Properties​

Relationships​

Reads Attributes​

Writes Attributes​

State Management​

Restricted​

Input Requirement​

System Resource Considerations​

See Also​

Description

Tags

Properties

Dynamic Properties

Relationships

Reads Attributes

Writes Attributes

State Management

Restricted

Input Requirement

System Resource Considerations

See Also