In the previous chapter, we saw how to search data. But how does the data get there?
The Python target includes a robust Semantic Data Pipeline capable of ingesting, flattening, and embedding complex hierarchical data (like XML/RDF) into a queryable format.
The pipeline consists of three stages, orchestrated by PtCrawler:
pt:parentTree).graph LR
A[XML Source] -->|lxml iterparse| B(Flattened Dict)
B -->|Text| C[ONNX Embedder]
B -->|Attributes| D[Link Extractor]
C -->|Vector| E[(SQLite: Embeddings)]
D -->|Source/Target| F[(SQLite: Links)]
B -->|JSON| G[(SQLite: Objects)]
UnifyWeaver uses a specific “Flattening” strategy optimized for RDF and Property Graphs.
Input XML (RDF):
<pt:PagePearl rdf:about="http://pearltrees.com/id/123">
<dcterms:title>My Page</dcterms:title>
<pt:parentTree rdf:resource="http://pearltrees.com/id/456" />
</pt:PagePearl>
Flattened Python Dict:
{
"@tag": "PagePearl",
"@about": "http://pearltrees.com/id/123", # Attributes get '@' prefix
"title": "My Page", # Child text becomes key
# pt:parentTree is processed as a link, not just data
}
This structure allows the PtSearcher to return clean, JSON-serializable data objects without the complexity of the original DOM.
One of the most powerful features of the PtCrawler is automatic link extraction. It scans the flattened data for standard relationship attributes:
rdf:resourceresource}resource (Namespaced)When found, it creates a directed edge in the links table:
@about or @id).This builds the graph that enables Graph RAG.
To trigger this pipeline, use the crawler_run/2 predicate in your Prolog script:
index_my_data :-
% Crawl 'data.rdf', expanding links up to depth 2
crawler_run(['data.rdf'], 2).
The compiler generates Python code that initializes the PtCrawler and executes the pipeline efficiently using streams.
| ← Previous: Chapter 2: Graph Retrieval-Augmented Generation (G… | 📖 Book 13: Semantic Search | Next: Chapter 4: Logic and Recursion in Python → |