PerformOCR
Description
Uses the Datavolo Tesseract OCR Service to extract text from a PDF or image, optionally providing metadata including the bounding box, page number and confidence level of the OCR.
Tags
datavolo, extract, image, jpeg, jpg, ocr, pdf, png, tesseract, text
Properties
In the list below required Properties are shown with an asterisk (*). Other properties are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.
Display Name | API Name | Default Value | Allowable Values | Description |
---|---|---|---|---|
OCR Service * | OCR Service | Controller Service: OCRService Implementations: StandardOCRService | An OCR Service for reading files to output text. | |
MIME Type * | MIME Type | application/pdf | The MIME Type of the input FlowFile. This is used to determine the format of the input data. Supports Expression Language, using FlowFile attributes and Environment variables. | |
Extract PDF Text * | Extract PDF Text | true |
| If true, the processor will attempt to extract text directly from the PDF files, rather than performing OCR. This can be more efficient and provide better results in many cases. In the case that text is not available in the PDF, OCR will be performed regardless of this setting. This property is only considered if:
|
Record Writer | Record Writer | Controller Service: RecordSetWriterFactory Implementations: AvroRecordSetWriter CSVRecordSetWriter FreeFormTextRecordSetWriter JsonRecordSetWriter RecordSetWriterLookup ScriptedRecordSetWriter XMLRecordSetWriter | Specifies the Controller Service to use for writing the results. If not specified, the results will be written to the FlowFile as plaintext. If the Record Writer is specified, each text block will be output as an individual Record. In this case, the Record will contain not only the text that was found but also the bounding box in the image/pdf where the text was found, as well as the page number and the confidence level of the OCR. Each Record will have the following fields: text , x , y , height , width , pageNumber , and confidence . | |
Confidence Threshold * | Confidence Threshold | 60 | The minimum confidence level required for a text block to be included in the output. Text blocks with a confidence level below this value will be excluded. Supports Expression Language, using FlowFile attributes and Environment variables. |
Dynamic Properties
This component does not support dynamic properties.
Relationships
Name | Description |
---|---|
comms.failure | If the processor is unable to communicate with the Tesseract OCR Service, the input FlowFile will be routed to this relationship. |
failure | If the text of a FlowFile cannot be extracted for any reason, the input FlowFile will be routed to this relationship. |
success | The text of the PDF is routed to the success relationship. |
Reads Attributes
This processor does not read attributes.
Writes Attributes
Name | Description |
---|---|
mime.type | The MIME Type of the FlowFile. |
text.extraction.method | The method used to extract the text from the FlowFile. This will be either 'PdfExtraction' or 'OCR'. |
State Management
This component does not store state.
Restricted
This component is not restricted.
Input Requirement
This component requires an incoming relationship.
System Resource Considerations
This component does not specify system resource considerations.