gleann-plugin-docs

Document extraction plugin for gleann. Converts PDF, DOCX, XLSX, PPTX and other binary document formats into graph-ready structured data that gleann ingests directly into KuzuDB + HNSW.

How It Works

The plugin acts as a document structure expert — like the AST code indexer understands code symbols and call relationships, gleann-plugin-docs understands document structure: titles, sections, subsections, and their hierarchy.

                 Plugin                          gleann
   ┌──────────────────────────────┐    ┌───────────────────────────┐
   │  PDF/DOCX/XLSX/...          │    │                           │
   │       ↓                     │    │  nodes + edges            │
   │  MarkItDown / Docling       │    │       ↓                   │
   │       ↓                     │    │  KuzuDB (Document Graph)  │
   │  section_parser.py          │    │                           │
   │       ↓                     │    │  section content          │
   │  { nodes, edges }      ────────→│       ↓                   │
   │                             │    │  MarkdownChunker          │
   └──────────────────────────────┘    │       ↓                   │
                                       │  HNSW (Vector Index)     │
                                       └───────────────────────────┘

Plugin response (POST /convert):

{
  "nodes": [
    {"_type": "Document", "path": "report.pdf", "title": "Q4 Report", "format": "pdf", ...},
    {"_type": "Section", "id": "doc:report.pdf:s0", "heading": "Introduction", "level": 1, "content": "...", ...},
    {"_type": "Section", "id": "doc:report.pdf:s0.0", "heading": "Background", "level": 2, "content": "...", ...}
  ],
  "edges": [
    {"_type": "HAS_SECTION", "from": "report.pdf", "to": "doc:report.pdf:s0"},
    {"_type": "HAS_SUBSECTION", "from": "doc:report.pdf:s0", "to": "doc:report.pdf:s0.0"}
  ]
}

Supported Formats

.pdf .docx .doc .xlsx .xls .pptx .ppt .png .jpg .jpeg .csv

Installation

Requirements: Python 3.10+

# 1. Clone and set up
git clone <this-repo> ~/.gleann/plugins/gleann-docs
cd ~/.gleann/plugins/gleann-docs
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# 2. Register with gleann
python main.py --install

# Done! gleann will auto-start the plugin when needed.

Optional — high-quality PDF processing with Docling:

pip install -r requirements-docling.txt

Note: After --install, do not move this folder — gleann saves the absolute path to auto-start the plugin. If you relocate it, run python main.py --install again.

Usage

No manual server management needed. When gleann build encounters a PDF/DOCX/etc., it:

Checks if the plugin is running (:8765/health)
If not, auto-starts it using the registered command
Sends the file → receives graph-ready nodes/edges
Writes Document + Section nodes to KuzuDB (with --graph)
Chunks section content via MarkdownChunker → embeds to HNSW

# Index a directory with PDFs
gleann build myindex ./docs --graph

# The plugin starts automatically — no manual intervention

Manual server (for debugging):

python main.py --serve --port 8765

Backends

Format	Backend
`.pdf`	Docling (if installed) → fallback MarkItDown
Everything else	MarkItDown

Disabling Docling:

python main.py --install --no-docling   # permanent
DOCLING_ENABLED=false python main.py    # one-time

Performance:

MarkItDown: ~0.01s/page (fast, good enough for most documents)
Docling: ~3.1s/page on CPU (better tables, OCR, layout analysis; 4-8 GB RAM)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
HIERARCHICAL_GRAPHRAG_DESIGN.md		HIERARCHICAL_GRAPHRAG_DESIGN.md
README.md		README.md
TODO.md		TODO.md
conftest.py		conftest.py
docling_backend.py		docling_backend.py
main.py		main.py
requirements-docling.txt		requirements-docling.txt
requirements.txt		requirements.txt
section_parser.py		section_parser.py
test_main.py		test_main.py
test_section_parser.py		test_section_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gleann-plugin-docs

How It Works

Supported Formats

Installation

Usage

Backends

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gleann-plugin-docs

How It Works

Supported Formats

Installation

Usage

Backends

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages