Welcome to Datavolo

Datavolo is a tool for data engineers supporting AI teams. Datavolo provides a framework, feature set, and a catalog of repeatable patterns to build multimodal data pipelines which are secure, simple, and scalable.

What’s the problem with existing pipelines for unstructured data?

ELT as a pattern is less natural in the multimodal setting. The systems that produce multimodal data are not relational. This will put the emphasis on Transformation, and the classic ETL setting, within multimodal data frameworks that can cope with the reality of bridging these non-relational data producing systems with the emerging consuming systems.

Many data pipeline solutions are based on row-oriented abstractions and are built for data with established structures and schemas. In the multimodal data world, not only is data often quite large and not structured as rows, in addition the traditional data platforms use point-to-point ELT architectures that don’t work well for the target systems relevant to LLM applications.

For example, once chunks of text are transformed to embeddings and indexed in a vector store or search index, it’s not possible to transform or enrich the data further as it is with SQL in structured, data warehouse settings.

How does Datavolo’s software solve those problems?

Built from the ground up to support the ETL pattern, Datavolo leverages a robust set of out-of-the-box processors to extract, clean, transform, enrich, and publish both unstructured and structured data. Datavolo is designed for continuous, event-driven ingest that can scale up on demand to cope with bursts of high-volume data. Our platform can handle a variety of data, including: audio and video image streams, a raw signal captured by a sensor, a deeply nested hierarchical structured JSON or XML, text-based log entries, and a highly structured database of rows and records. This industry is witnessing rapid advancement, and we know that flexibility will be a critical component for data engineers as the stack continues to evolve and open questions are answered. That’s why Datavolo’s data pipelines and orchestration capabilities are purpose-built to provide flexibility to easily swap APIs, sources, targets, and models.

How can existing generative AI models be improved by enabling them to tap into all of your data?

In order for enterprises to get the most value out of integration with foundational language models, they also need to include their own business data, since these models were not pre-trained on their specific data. We strongly believe that successful AI apps will be built on AI systems, not directly on top of AI models, and that useful AI systems must include the ability to retrieve contextual data from enterprise data systems to supplement the generative capabilities of LLMs and drive business value.

Datavolo's core value proposition entails:

Datavolo provides a visual, low-code experience that is easy to use and ships with hundreds of out-of-the-box integrations to the AI ecosystem–sources, targets, embedding models, LLMs, vector databases and more This drives development velocity, without sacrificing solution quality and promotes reuse and modularity of solutions, avoiding wasted effort across the enterprise
Datavolo supports continuous and scalable event-based ingestion Including critical data engineering capabilities, such as, error handling, observability, scheduling, data governance & security
Datavolo provides flexibility to easily swap APIs, change transformations, sources, destinations, & models Extensibility with custom Python processors
Datavolo automatically captures data provenance and lineage out-of-the-box for all dataflows

Datavolo is based on the open-source Apache NiFi project, so what does it bring to the table that’s new?

Datavolo is poised to bring several new things to the table, including a managed cloud-native architecture (Datavolo Cloud) and container images (Datavolo Server) that will make it easier to operate, scale, and secure data flows. Datavolo will also include a new set of AI-specific capabilities for building LLM apps, including new processors and Datavolo-specific APIs to support advanced RAG patterns, integrations with embedding models, vector databases, and foundational LLMs. Datavolo will also be imbuing the data engineering experience with AI, including capabilities to create, understand, and manipulate scripted transforms as well as entire data flows using natural language.

Let's dive in!

What’s the problem with existing pipelines for unstructured data?​

How does Datavolo’s software solve those problems?​

How can existing generative AI models be improved by enabling them to tap into all of your data?​

Datavolo is based on the open-source Apache NiFi project, so what does it bring to the table that’s new?​

What’s the problem with existing pipelines for unstructured data?

How does Datavolo’s software solve those problems?

How can existing generative AI models be improved by enabling them to tap into all of your data?

Datavolo is based on the open-source Apache NiFi project, so what does it bring to the table that’s new?