https://www.youtube.com/watch?v=2KVkpUGRtnk This YouTube video provides a comprehensive tutorial on building a real-time knowledge graph from a collection of documents using Large Language Models (LLMs) and the Cocolndex data transformation framework, with Neo4j as the graph database. Project Goal: The primary goal is to process a list of markdown documents, extract key concepts (entities) and relationships between them using LLMs, and then build a knowledge graph in Neo4j. Additionally, it aims to establish relationships indicating which entities are mentioned in which documents. Key Technologies Used:
- Cocolndex: A real-time data transformation framework for AI, built with a Rust core engine. It’s an ETL (Extract, Transform, Load) framework designed for preparing fresh data for AI, including building knowledge graphs and creating embeddings.
- Neo4j: A popular graph database used to store the knowledge graph.
- PostgreSQL: Used by Cocolndex internally for incremental processing, ensuring efficient updates.
- Large Language Models (LLMs): Specifically, OpenAI’s GPT-4o is used for summarization and relationship extraction. Ollama is mentioned as an alternative for on-premise LLM execution.
- CocoInsight: A data observability tool for Cocolndex, used to visualize and inspect the data flow and transformations.
Detailed Workflow Steps:
-
Project Setup (04:20): Prerequisites: Install PostgreSQL (Cocolndex uses it for incremental processing), Neo4j (graph database), and configure an OpenAI API key (or set up Ollama for local LLMs). Docker Compose is recommended for easy setup of databases. Project Initialization: Create a project folder (e.g.,
graph). Createpyproject.tomlto define project metadata and dependencies (cocolndex,python-dotenv). Createmain.pyas the main script for the Cocolndex flow. Create.envfile to store database URLs (COCOINDEX_DATABASE_URL) and the OpenAI API key (OPENAI_API_KEY). Install dependencies usingpip3 install -e . -
Cocolndex Flow Definition (
**main.py**) (05:09): Flow Declaration: A Cocolndex data flow namedDocsToKGis defined. Source Data Ingestion (05:19): TheLocalFilesource is used to ingest markdown documents from a specified directory (../docs/core). It includes patterns to filter for.mdand.mdxfiles. This step generates aKTablenameddocumentswithfilenameandcontentcolumns. -
LLM-based Document Summarization (07:33): Data Class Definition (07:54): A Python dataclass
DocumentSummaryis defined withtitle: strandsummary: strfields. This acts as a schema for the LLM’s output. Transformation: Thecocolndex.functions.ExtractByLlmtransformation is applied to thecontentof each document. It usesgpt-4oas the LLM model.output_typeis set toDocumentSummaryto guide the LLM’s output format.instructionprompts the LLM to “Please summarize the content of the document.” Collector (Document Nodes) (07:41): A collector nameddocument_nodeis defined to gather information about each document. It collects thefilename(as its primary key), thetitlefrom the LLM-generated summary, and thesummaryitself. This prepares the data for export as Neo4j nodes with the labelDocument. -
LLM-based Relationship Extraction (09:26): Data Class Definition (09:50): A Python dataclass
Relationshipis defined withsubject: str,predicate: str, andobject: str. A detailed docstring is added to guide the LLM on what constitutes a relationship (e.g., nouns for subject/object, ignoring examples/code). Transformation: Anothercocolndex.functions.ExtractByLlmtransformation is applied to thecontentof each document.output_typeislist[Relationship], indicating that the LLM should output a list of triples.instructionprompts the LLM to “Please extract relationships from Cocolndex documents. Focus on concepts and ignore examples and code.” Collector (Entity Relationships) (09:32): A collector namedentity_relationshipis defined to collect the extracted relationships. It assigns a UUID as a unique ID for each relationship. It collectssubject,object, andpredicatefrom the LLM’s output. -
Document-Entity Mention Mapping (13:43): Collector (Entity Mentions): A collector named
entity_mentionis used to create relationships indicating which entities are mentioned in which documents. This is done by collecting thesubjectandobjectfrom theentity_relationshipcollector, along with the original document’sfilename. This creates implicit links based on the existence of entities within a document. -
Export to Neo4j (17:31): Connection Specification: A
conn_specis defined for the Neo4j database connection, including URL, username, and password. Node Declaration (**Entity**Node) (19:22): Since entities are generated from relationships and not directly from the initial documents, aflow_builder.declare()statement is used to inform Cocolndex about theEntitynode type for Neo4j. It specifiesEntityas the label andvalueas the primary key. Document Node Export (17:42):document_node.export()maps thedocument_nodecollector’s data to Neo4j nodes.labelis set to “Document”, andprimary_key_fieldsis set to “filename”. All fields from thedocument_nodecollector (filename, title, summary) are carried over as properties of the Neo4j “Document” nodes. Relationship Export (19:33):entity_relationship.export()maps the collected relationships to Neo4j relationships.rel_typeis set to “RELATIONSHIP”.sourceandtargetdefinitions useNodeFromFieldsto specify that the source and target of the relationship in Neo4j are “Entity” nodes, mapped by their “value” primary key. Fields from theentity_relationshipcollector (id, subject, object, predicate) are carried over as properties of the Neo4j “RELATIONSHIP” links. Entity Mention Export (20:56):entity_mention.export()maps the document-entity mentions to Neo4j relationships.rel_typeis set to “MENTION”.sourceis mapped to the “Document” node (usingfilename).targetis mapped to the “Entity” node (usingvalue). Fields from theentity_mentioncollector (id, entity, filename) are carried over as properties of the Neo4j “MENTION” links.
Cocolndex Features Highlighted:
- Real-time & Incremental Processing: Cocolndex’s core strength, allowing continuous updates to the knowledge graph as source data changes.
- ETL Framework: Simplifies data extraction, transformation (especially with AI assistance), and loading into target storages.
- LLM Integration: Provides
ExtractByLlmfunction to easily leverage LLMs for structured data extraction and summarization using data classes and prompts. - Property Graph Targets: Native support for mapping data to graph elements (nodes, relationships, properties) in graph databases like Neo4j.
- Deduplication: Automatically handles deduplication of nodes based on defined primary keys, preventing redundant entries in the knowledge graph.
- Data Observability (CocoInsight): The
cocolndex server -cicommand launches a web interface to visually inspect the data flow, data transformations, and the intermediate states of the data.
Conclusion: The video successfully demonstrates how Cocolndex simplifies the complex process of building a real-time knowledge graph from unstructured documents using LLMs and a graph database like Neo4j. It emphasizes Cocolndex’s ability to handle data transformations, manage data integrity through primary keys, and provide observability into the data pipeline.
Related Concepts
- information extraction — Wikipedia
- knowledge graph — Wikipedia
- LLMs — Wikipedia
- ETL framework — Wikipedia
- Neo4j — Wikipedia
- m — Wikipedia
- a — Wikipedia
- x — Wikipedia
- 1 — Wikipedia
- 6 — Wikipedia
- c — Wikipedia
- o — Wikipedia
- n — Wikipedia
- i — Wikipedia
- s — Wikipedia
- e — Wikipedia
- t — Wikipedia
- h — Wikipedia
- l — Wikipedia
- p — Wikipedia