Advanced Document Parsing

Introduction

Generative AI, especially since the launch of ChatGPT in late 2022, has revolutionized how we interact with data. For enterprises, the true value of Generative AI lies in enabling users to interact with their data using natural language. This is typically achieved using a pattern called Retrieval Augmented Generation (RAG).

At Datavolo, we help you build multimodal data pipelines to support your AI initiatives. Often, data engineers deal with unstructured text from sources like PDF documents. These documents contain text, images, tables, and other elements that must be extracted and processed, presenting a challenge because the documents are designed for human readability, not machine readability.

Document Format

There is a vast array of document formats that contain rich information, from PDFs to Word documents to HTML. To avoid significant amounts of redundant work, we would like to process documents in each of these formats in much the same way. To accomplish this, we need a common way to represent the information in these documents. At Datavolo, we represent the information in these documents using JSON with a specific Document Schema. The schema allows us to represent the text that we extract, along with the structure of the document itself, and any additional information, including tables and images.

The JSON schema that we use to represent a document is provided below in the JSON Schema format. This schema allows us to represent both the data that was extracted from the original document and any additional information that we generate during processing.

JSON Schema

The Datavolo Document JSON Schema is as follows:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "http://datavolo.io/schemas/document.json",
  "title": "Document",
  "type": "object",
  "properties": {
    "container": {
      "type": "object",
      "properties": {
        "id": {
          "type": "string"
        },
        "boundingBox": {
          "type": "object",
          "properties": {
            "x": {
              "type": "number"
            },
            "y": {
              "type": "number"
            },
            "width": {
              "type": "number"
            },
            "height": {
              "type": "number"
            }
          }
        },
        "title": {
          "type": "string"
        },
        "metadata": {
          "type": "object"
        },
        "scope": {
          "type": "string",
          "enum": [
            "DOCUMENT",
            "SECTION",
            "NARRATIVE_TEXT",
            "LIST",
            "IMAGE",
            "TABLE",
            "PAGE_HEADER",
            "PAGE_FOOTER"
          ]
        }
      },
      "oneOf": [
        {
          "properties": {
            "textElement": {
              "type": "object",
              "properties": {
                "text": {
                  "type": "string"
                },
                "metadata": {
                  "type": "object"
                }
              }
            },
            "processingElement": {
              "type": "object",
              "properties": {
                "representations": {
                  "type": "array",
                  "items": {
                    "type": "object",
                    "properties": {
                      "data": {
                        "type": "string"
                      },
                      "metadata": {
                        "type": "object"
                      }
                    }
                  }
                },
                "metadata": {
                  "type": "object"
                }
              }
            }
          }
        },
        {
          "properties": {
            "containers": {
              "type": "array",
              "items": {
                "$ref": "#/properties/container"
              }
            }
          }
        }
      ],
      "required": [
        "id",
        "scope"
      ]
    }
  },
  "required": [
    "container"
  ]
}

Schema Explanation

The root of the JSON object is a container object. Each container represents a specific element in the document. The container consists of the following properties:

id: A unique identifier for the container.
boundingBox: The position of the element on the document's page (optional).
title: The title or heading of the section. This value is populated for each new section in a document and is not expected for elements.
metadata: Additional information about the document element. Common metadata keys are start.page, end.page, and category.confidence.
scope: The type of information represented by the container.

A container can contain either an array of child containers, or a textElement and/or processingElement object. This structure ensures atomic updates, preventing outdated information.

Processing Document Elements

Document processing generally consists of a few steps:

Parsing: Convert documents into JSON using processors like ParsePdf.
Processing: Handle extracted information, such as images or tables, using various processors. For example, PromptOpenAI might be used to generate a description of an image. ParseTableImage might be used to parse an image of a table and return the table as a CSV. ReplaceText might then be used to add a prefix and/or suffix to the text generated, or ConvertRecord might be used to convert CSV into JSON.
Reassembling: Merge processed elements back into the original Document JSON using MergeDocumentElements.
Chunking: Split the document up into chunks of text that are smaller enough to have their semantic meaning encoded using an embedding model. The ChunkDocument processor chunks the data using structural information from the document. ChunkText can then be used to further chunk the data based on the semantic meaning of the text.
Generate Embeddings: Create vector embeddings from document chunks using processors like CreateOpenAiEmbeddings.
Store Embeddings: Store the embeddings in a vector database. For example, UpsertPinecone is used to store embeddings in Pinecone.

By breaking down the processing pipeline into a series of simple steps, Datavolo allows you to quickly build powerful pipelines for unstructured data and tailor them to your individual needs.

Introduction​

Document Format​

JSON Schema​

Schema Explanation​

Processing Document Elements​

Introduction

Document Format

JSON Schema

Schema Explanation

Processing Document Elements