Skip to main content

PerformOCR

Description

Uses the Datavolo Tesseract OCR Service to extract text from a PDF or image, optionally providing metadata including the bounding box, page number and confidence level of the OCR.

Tags

datavolo, extract, image, jpeg, jpg, ocr, pdf, png, tesseract, text

Properties

In the list below required Properties are shown with an asterisk (*). Other properties are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueAllowable ValuesDescription
OCR Service *OCR ServiceController Service:
OCRService

Implementations:
StandardOCRService
An OCR Service for reading files to output text.
MIME Type *MIME Typeapplication/pdfThe MIME Type of the input FlowFile. This is used to determine the format of the input data.

Supports Expression Language, using FlowFile attributes and Environment variables.
Extract PDF Text *Extract PDF Texttrue
  • true
  • false
If true, the processor will attempt to extract text directly from the PDF files, rather than performing OCR. This can be more efficient and provide better results in many cases. In the case that text is not available in the PDF, OCR will be performed regardless of this setting.

This property is only considered if:
  • the property MIME Type has a value of application/pdf
Record WriterRecord WriterController Service:
RecordSetWriterFactory

Implementations:
AvroRecordSetWriter
CSVRecordSetWriter
FreeFormTextRecordSetWriter
JsonRecordSetWriter
RecordSetWriterLookup
ScriptedRecordSetWriter
XMLRecordSetWriter
Specifies the Controller Service to use for writing the results. If not specified, the results will be written to the FlowFile as plaintext.
If the Record Writer is specified, each text block will be output as an individual Record. In this case, the Record will contain not only the text
that was found but also the bounding box in the image/pdf where the text was found, as well as the page number and the confidence level of the OCR.
Each Record will have the following fields: text, x, y, height, width, pageNumber, and confidence.
Confidence Threshold *Confidence Threshold60The minimum confidence level required for a text block to be included in the output. Text blocks with a confidence level below this value will be excluded.

Supports Expression Language, using FlowFile attributes and Environment variables.

Dynamic Properties

This component does not support dynamic properties.

Relationships

NameDescription
comms.failureIf the processor is unable to communicate with the Tesseract OCR Service, the input FlowFile will be routed to this relationship.
failureIf the text of a FlowFile cannot be extracted for any reason, the input FlowFile will be routed to this relationship.
successThe text of the PDF is routed to the success relationship.

Reads Attributes

This processor does not read attributes.

Writes Attributes

NameDescription
mime.typeThe MIME Type of the FlowFile.
text.extraction.methodThe method used to extract the text from the FlowFile. This will be either 'PdfExtraction' or 'OCR'.

State Management

This component does not store state.

Restricted

This component is not restricted.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

This component does not specify system resource considerations.

See Also

ConvertPdfToImage