Towards a grounded extraction of semantic information from private data

This proposal outlines the development of a knowledge acquisition system for Vendia, inspired by Microsoft's GraphRAG, which extracts semantic information from private client data to construct knowledge graphs. These graphs provide deep, contextually relevant insights for data categorization and analysis to enhance AI applications such as event-driven pipelines, natural language analytics, and conversational task completion.


Knowledge graphs and large language models (LLMs) have become a powerful tool for extracting and generating information from text. The recent success of the Retrieval-Augmented Generation (RAG) agent has shown that LLMs can be used to generate new knowledge by querying a knowledge graph. This approach has been used to generate conversational agents, answer questions, and even create clinical decision support systems.

However, one of the most pressing challenges in modern AI systems is the unreliability of LLMs to produce accurate and reliable information. Even in traditional ML systems, the classification and clustering of data is often brittle and prone to error. Historically, knowledge-based systems have been used to ground the information that an AI system produces, but these knowledge bases are often incomplete and require manual curation. Much of the work in this area has focused on either using logical reasoning to acquire new knowledge or using statistical methods to infer new knowledge.

Microsoft's Jonathan Larson presents GraphRAG.

Recent work has achieved success by using LLMs to generate new knowledge, showing that the weighted relationships extracted among entities are far richer than a simple co-occurence matrix. However, the challenge remains in finding semantic aggregations, hierarchies, and deep relationships among entities. Microsoft has been working on GraphRAG to build a RAG agent that leverages a knowledge graphs extracted from private data. They define private data as "data that an LLM is not trained on and has never seen before, such as an enterprise’s proprietary research, business documents, or communications."

While RAG is an incredibly useful tool, we often want our systems to behave like systems, not people. In Vendia's product landscape, there's room for both: deep semantic representations of client data enable rich conversational interactions and agent behavior

Knowledge Graph Acquisition

In a recent discussion with Eleni about Vendia, she mentioned the company's potential interest in the automatic categorization of client data. One significant challenge in using LLMs and RAG agents with private data is the data retention policies of LLM providers and the ability to use client data for online model training.

A key advantage of a GraphRAG-like knowledge acquisition system is its ability to extract a knowledge graph, including deep semantic relationships among nodes, directly from the client's data. This approach trades the traditional ML train/eval/validate loop with a higher runtime and context length cost, but has the advantage of extracting far richer semantic relations. Since the knowledge graph is derived from and directly references the client's data, its semantics uniquely reflect the context of their domain.

The Proposal

To illustrate the value I can bring to Vendia, I propose developing a knowledge acquisition system that extracts semantic information from client data and constructs a knowledge graph for its categorization and analysis. This system, inspired by the GraphRAG approach, will be customized to meet the specific needs of Vendia and its clients. It will serve as the foundational context layer for any AI systems Vendia develops, offering a rich, deep, and contextually relevant understanding of each client's data.

One potential application of this system is a data categorization tool that automatically tags and organizes client data points to facilitate the creation and management of event-driven pipelines.

Furthermore, a trend in enterprise products is the integration of assistant interfaces like SAP's Joule and Microsoft's Copilot. A robust knowledge acquisition layer will provide the semantic foundation necessary for advanced agent behaviors such as natural language analytics, conversational task completion, and globally aware agent planning and decision making.

My research experience at the Language Endowed Intelligent Agent's Lab at Rensselaer Polytechnic Institute, where I was a doctoral candidate and research fellow, along with my background as a designer and engineer, uniquely qualifies me to lead this initiative at Vendia. I am highly passionate about the careful application of AI to complex business and human problems, and would love the opportunity to drive the future of data connectivity with you.