Datavolo Processors
The processors listed below are exclusive to Datavolo Runtimes.
CaptureGoogleDriveChanges
Captures changes to a Shared Google Drive and emits a FlowFile for each change that occurs
CaptureSharepointChanges
Captures changes from a Sharepoint Document Library and emits a FlowFile for each change that occurs
ChunkDocument
Given an input Datavolo Document, chunks the data into segments that are more applicable for LLM synthesis or semantic embedding
ChunkText
Chunks text with options for recursively splitting by delimiters and max character length
ConsumeKafka
Consumes messages from Apache Kafka Consumer API
ConvertOfficeFormat
Converts a Open Office compatable file, to a PDF or Docx format
ConvertPdfToImage
Converts a PDF file into a series of images, one for each page.
CreateAzureOpenAiEmbeddings
Uses Azure OpenAI to create embeddings for text
CreateCohereEmbeddings
Uses Cohere to create embeddings for text
CreateOllamaEmbeddings
Uses Ollama to create embeddings for text
CreateOpenAiEmbeddings
Uses OpenAI to create embeddings for text
CreateSnowflakeEmbeddings
Create vector embeddings using Snowflake Cortex Large Language Model functions
CreateVertexAIEmbeddings
Uses VertexAI to create embeddings for text
DeleteDBFSResource
Delete a DBFS files and directories.
DeleteMilvus
Deletes vectors from Milvus database from a collection by ID
DeletePinecone
Deletes vectors from a Pinecone index.
DeleteUnityCatalogResource
Delete a Unity Catalog file or directory.
DetectDocumentPII
This processor accepts a parsed document then returns the document with metadata containing text positions of recognized PII entities
EnrichAttributes
Looks up a value using the configured Lookup Service and adds the results to the FlowFile as one or more attributes
EvaluateRagAnswerCorrectness
Evaluates the correctness of generated answers in a Retrieval-Augmented Generation (RAG) context by computing metrics such as F1 score, cosine similarity, and answer correctness
EvaluateRagFaithfulness
Evaluates the faithfulness of generated answers in a Retrieval-Augmented Generation (RAG) system by analyzing responses using an LLM (e.g., OpenAI's GPT)
EvaluateRagRetrieval
Calculates retrieval metrics (Precision@N, Recall@N, FScore@N, MAP@N, MRR) for a RAG system using an LLM as a judge
ExecuteSQLStatement
Executes a SQL DDL or DML Statement against a database
ExtractDocumentRawText
Extracts the text from a Document and writes it to the FlowFile content
FetchSharepointFile
Fetches the contents of a file from a Sharepoint Drive, optionally downloading a PDF or HTML version of the file when applicable
FormatWordDocument
Formats a MS Word docx file
GenerateAnswersFromContext
Generates synthetic answers for each question present in the incoming records using a Large Language Model (LLM)
GenerateAnswersFromGroundTruth
Generates synthetic answers for each question in the incoming records using an LLM
GetDBFSFile
Read a DBFS file.
GetHubSpotObject
Get a HubSpot object and its associations by ID or unique value.
GetUnityCatalogFile
Read a Unity Catalog file up to 5 GiB.
GetUnityCatalogFileMetadata
Checks for Unity Catalog file metadata.
ListDBFSDirectory
List file names in a DBFS directory and output a new FlowFile with the filename.
ListUnityCatalogDirectory
List file names in a Unity Catalog directory and output a new FlowFile with the filename.
MergeDocumentElements
Given a FlowFile that contains a full Document and one more FlowFiles that contain additional data to merge into the Document, this Processor will merge the additional data into the Document
OpenAiTranscribeAudio
Transcribes audio into English text
ParsePdfDocument
Parses a PDF file, extracting the text and additional information into a structured JSON document
ParseTableImage
Extracts the text from a Table image and writes it to the FlowFile content in csv format.
PerformOCR
Uses the Datavolo Tesseract OCR Service to extract text from a PDF or image, optionally providing metadata including the bounding box, page number and confidence level of the OCR.
PromptAnthropicAI
Sends a prompt to Anthropic, writing the response either as a FlowFile attribute or to the contents of the incoming FlowFile
PromptAzureOpenAI
Sends a prompt to Azure's OpenAI service, writing the response either as a FlowFile attribute or to the contents of the incoming FlowFile
PromptLLM
This processor sends a user defined prompt to a Large Language Model (LLM) to respond.
PromptOllama
Sends a prompt to Ollama, writing the response either as a FlowFile attribute or to the contents of the incoming FlowFile
PromptOpenAI
Sends a prompt to OpenAI, writing the response either as a FlowFile attribute or to the contents of the incoming FlowFile
PromptSnowflakeCortex
Sends a prompt to Snowflake Cortex, writing the response either as a FlowFile attribute or to the contents of the incoming FlowFile
PromptVertexAI
Sends a prompt to VertexAI, writing the response either as a FlowFile attribute or to the contents of the incoming FlowFile
PublishKafka
Sends the contents of a FlowFile as either a message or as individual records to Apache Kafka using the Kafka Producer API
PutDatabricksSQL
Submit a SQL Execution using Databricks REST API then write the JSON response to FlowFile Content
PutDBFSFile
Write FlowFile content to DBFS.
PutHubSpot
Upsert a HubSpot object.
PutIcebergTable
Store records in Iceberg using configurable Catalog for managing namespaces and tables.
PutMLflow
Record metadata in MLflow
PutSnowflakeInternalStageFile
Puts files into a Snowflake internal stage
PutUnityCatalogFile
Write FlowFile content with max size of 5 GiB to Unity Catalog.
PutVectaraDocument
Generate and upload a JSON document to Vectara's upload endpoint
PutVectaraFile
Upload a FlowFile content to Vectara's index endpoint
PutVespaDocument
Uses Vespa document api to update a record in a specific namespace.
QueryDocument
Evaluates a SQL-like query against the incoming Datavolo Document JSON, producing the results on the outgoing FlowFile
QueryMilvus
Queries a given collection in a Milvus database using vectors
QueryPinecone
Queries Pinecone for vectors that are similar to the input vector, or retrieves a vector by ID.
RunDatabricksJob
Triggers a pre-defined Databricks job to run with custom parameters
SummarizeText
This processor uses a Large Language Model (LLM) to summarize the content of a FlowFile
UpsertMilvus
Upserts vectors into Milvus database for a given collection
UpsertPinecone
Publishes vectors, including metadata, and optionally text, to a Pinecone index.