Advanced Document Parsing
Introduction
Generative AI, especially since the launch of ChatGPT in late 2022, has revolutionized how we interact with data. For enterprises, the true value of Generative AI lies in enabling users to interact with their data using natural language. This is typically achieved using a pattern called Retrieval Augmented Generation (RAG).
At Datavolo, we help you build multimodal data pipelines to support your AI initiatives. Often, data engineers deal with unstructured text from sources like PDF documents. These documents contain text, images, tables, and other elements that must be extracted and processed, presenting a challenge because the documents are designed for human readability, not machine readability.
Document Format
There is a vast array of document formats that contain rich information, from PDFs to Word documents to HTML. To avoid significant amounts of redundant work, we would like to process documents in each of these formats in much the same way. To accomplish this, we need a common way to represent the information in these documents. At Datavolo, we represent the information in these documents using JSON with a specific Document Schema. The schema allows us to represent the text that we extract, along with the structure of the document itself, and any additional information, including tables and images.
The JSON schema that we use to represent a document is provided below in the JSON Schema format. This schema allows us to represent both the data that was extracted from the original document and any additional information that we generate during processing.
JSON Schema
The Datavolo Document JSON Schema is as follows:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://datavolo.io/schemas/document.json",
"title": "Document",
"type": "object",
"properties": {
"container": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"boundingBox": {
"type": "object",
"properties": {
"x": {
"type": "number"
},
"y": {
"type": "number"
},
"width": {
"type": "number"
},
"height": {
"type": "number"
}
}
},
"title": {
"type": "string"
},
"metadata": {
"type": "object"
},
"scope": {
"type": "string",
"enum": [
"DOCUMENT",
"SECTION",
"NARRATIVE_TEXT",
"LIST",
"IMAGE",
"TABLE",
"PAGE_HEADER",
"PAGE_FOOTER"
]
}
},
"oneOf": [
{
"properties": {
"textElement": {
"type": "object",
"properties": {
"text": {
"type": "string"
},
"metadata": {
"type": "object"
}
}
},
"processingElement": {
"type": "object",
"properties": {
"representations": {
"type": "array",
"items": {
"type": "object",
"properties": {
"data": {
"type": "string"
},
"metadata": {
"type": "object"
}
}
}
},
"metadata": {
"type": "object"
}
}
}
}
},
{
"properties": {
"containers": {
"type": "array",
"items": {
"$ref": "#/properties/container"
}
}
}
}
],
"required": [
"id",
"scope"
]
}
},
"required": [
"container"
]
}
Schema Explanation
The root of the JSON object is a container
object. Each container represents a specific element in the document.
The container
consists of the following properties:
id
: A unique identifier for the container.boundingBox
: The position of the element on the document's page (optional).title
: The title or heading of the section. This value is populated for each new section in a document and is not expected for elements.metadata
: Additional information about the document element. Common metadata keys arestart.page
,end.page
, andcategory.confidence
.scope
: The type of information represented by the container.
A container can contain either an array of child containers, or a textElement
and/or processingElement
object. This structure ensures
atomic updates, preventing outdated information.
Processing Document Elements
Document processing generally consists of a few steps:
- Parsing: Convert documents into JSON using processors like
ParsePdf
. - Processing: Handle extracted information, such as images or tables, using various processors.
For example,
PromptOpenAI
might be used to generate a description of an image.ParseTableImage
might be used to parse an image of a table and return the table as a CSV.ReplaceText
might then be used to add a prefix and/or suffix to the text generated, or ConvertRecord might be used to convert CSV into JSON. - Reassembling: Merge processed elements back into the original Document JSON using MergeDocumentElements.
- Chunking: Split the document up into chunks of text that are smaller enough to have their semantic meaning encoded using an embedding model.
The
ChunkDocument
processor chunks the data using structural information from the document.ChunkText
can then be used to further chunk the data based on the semantic meaning of the text. - Generate Embeddings: Create vector embeddings from document chunks using processors like
CreateOpenAiEmbeddings
. - Store Embeddings: Store the embeddings in a vector database. For example,
UpsertPinecone
is used to store embeddings in Pinecone.
By breaking down the processing pipeline into a series of simple steps, Datavolo allows you to quickly build powerful pipelines for unstructured data and tailor them to your individual needs.