Distributed Processing System

A generic distributed processing system using NATS JetStream for message routing and supporting multiple service types (PDF processing, image analysis, text processing, etc.).

🏗️ Architecture Overview

Infrastructure Server    Processing Servers         Client Applications
     (NATS)              (GPU, CPU, etc.)           (Laptop, Web, etc.)
        |                       |                           |
   ┌────────────┐        ┌─────────────┐             ┌──────────────┐
   │    NATS    │◄──────►│ PDF Worker  │             │   Your App   │
   │ JetStream  │        │ Image Worker│◄───────────►│  (services.py│
   │  Message   │        │ Text Worker │             │   thinktank2) │
   │   Broker   │        │     ...     │             │      ...     │
   └────────────┘        └─────────────┘             └──────────────┘
        │                       │                           │
   Pure Messaging         Business Logic              Submit Requests

📁 Directory Structure

ct/
├── infrastructure/          # 🏗️ Infrastructure components
│   └── nats-server/        # Pure NATS server (dedicated server)
├── pdf/                    # 📄 PDF processing service (GPU server)
│   ├── docling_worker.py   # Worker process
│   ├── services.py         # Client library
│   └── tests/             # Service tests
└── future_services/        # 🔮 Add more services as needed
    ├── image_processing/
    ├── text_analysis/
    └── audio_transcription/

🚀 Quick Start

1. Infrastructure Setup (NATS Server)

On your dedicated NATS server:

cd infrastructure/nats-server/
./setup.sh
# Save the generated token - you'll need it for all services!

2. PDF Processing Service Setup (GPU Server)

On your GPU server:

cd pdf/
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Configure environment
cp environment_config.txt .env
# Edit .env with NATS server IP and token

# Start docling worker (docs.process.*)
./start_worker.sh

# Optional: GLiNER KG infer (kg.infer) — same venv + extra deps
pip install -r requirements-gliner.txt
./start_kg_gliner.sh

GPU deploy (both workers): push to climateandtech/pdf main, then from coolify-provisioning/ run ./gpu-setup-production.sh (once) and ./gpu-deploy-worker.sh. See docs/GPU_PRODUCTION.md.

Platform calls kg.infer when KG_EXTRACT_ON_GPU=1 (no separate platform clone on GPU).

3. Client Integration (Your Laptop)

Connect your existing services.py:

# In your thinktank2 project
from pdf.services import DocumentService

# Configure to point to your NATS server
doc_service = DocumentService()
await doc_service.setup()

result = await doc_service.process_document(
    s3_key="documents/my-file.pdf",
    docling_options={...}
)

🎛️ Configuration

Infrastructure Server (.env)

# Pure NATS configuration - no service specifics
NATS_TOKEN=your-generated-secure-token

Processing Services (.env)

# Points to your infrastructure
NATS_URL=nats://your-nats-server-ip:4222
NATS_TOKEN=your-generated-secure-token

# Service-specific settings
AWS_ACCESS_KEY_ID=your-s3-credentials
# ... etc

🔧 Service Types & Namespacing

Each service type gets its own namespace on the shared NATS server:

Service Type	Stream Name	Subject Prefix	Worker Group
PDF Docling	`PDF_PROCESSING`	`pdf.docling.*`	`pdf_docling_workers`
Image Processing	`IMAGE_PROCESSING`	`image.process.*`	`image_workers`
Text Analysis	`TEXT_ANALYSIS`	`text.analyze.*`	`text_workers`
Audio Transcription	`AUDIO_TRANSCRIPTION`	`audio.transcribe.*`	`audio_workers`

📋 Deployment Scenarios

Scenario 1: Simple Setup

NATS Server: 1 dedicated server
PDF Processing: 1 GPU server
Clients: Your laptop

Scenario 2: Production Setup

NATS Cluster: 3 servers (HA)
PDF Workers: Multiple GPU servers (auto-scaling)
Image Workers: Multiple CPU servers
Clients: Web applications, mobile apps, etc.

Scenario 3: Development

NATS: Local Docker container
Workers: Local processes
Clients: Local development

🛡️ Security

Token Authentication: Secure token for NATS access
Network Isolation: Firewall rules for known IPs only
TLS: Optional TLS encryption for production
Separate Concerns: Infrastructure vs. business logic

🔄 Adding New Services

Create service directory: mkdir new_service/
Implement worker: Use existing patterns from pdf/
Configure namespace: Add to generic_config.py
Deploy: On appropriate servers (GPU, CPU, etc.)
Connect: All services use the same NATS infrastructure

📖 Documentation

Infrastructure Setup - NATS server deployment
Architecture Guide - Detailed system design
PDF Service - PDF processing specifics

🧪 Testing

# Test infrastructure
cd infrastructure/nats-server/
# Connection tests included in setup

# Test PDF service
cd pdf/
pytest tests/ -v

# Test end-to-end
python -c "
import asyncio
from services import DocumentService

async def test():
    service = DocumentService()
    await service.setup()
    print('✅ Connected to distributed system!')

asyncio.run(test())
"

🎯 Key Benefits

✅ Scalable: Add processing power by adding servers
✅ Flexible: Mix different service types on same infrastructure
✅ Reliable: Dedicated message infrastructure
✅ Maintainable: Clear separation of concerns
✅ Future-proof: Easy to add new processing capabilities

Perfect for: Multi-modal AI processing, distributed computing, microservices architecture

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
benchmarks/parser/registry		benchmarks/parser/registry
config		config
docs		docs
infrastructure		infrastructure
kg_gliner		kg_gliner
scripts		scripts
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile.gliner		Dockerfile.gliner
MEMORY_OPTIMIZATION_GUIDE.md		MEMORY_OPTIMIZATION_GUIDE.md
NATS_SETUP.md		NATS_SETUP.md
README.md		README.md
S3_INTEGRATION_README.md		S3_INTEGRATION_README.md
chunk_job.py		chunk_job.py
client_nats_objectstore.py		client_nats_objectstore.py
config.py		config.py
constraints-cu12.txt		constraints-cu12.txt
deploy_service.sh		deploy_service.sh
deploy_worker.sh		deploy_worker.sh
docling_chunk_worker.py		docling_chunk_worker.py
docling_options_examples.py		docling_options_examples.py
docling_worker.py		docling_worker.py
environment_config.txt		environment_config.txt
generic_config.py		generic_config.py
gpu_memory_config.py		gpu_memory_config.py
hierarchical_chunker.py		hierarchical_chunker.py
kg_gliner_worker.py		kg_gliner_worker.py
memory_patch.py		memory_patch.py
nats-server.conf		nats-server.conf
nemotron_service.py		nemotron_service.py
parse_artifact_storage.py		parse_artifact_storage.py
parse_modes.py		parse_modes.py
parser_registry.py		parser_registry.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-benchmark.txt		requirements-benchmark.txt
requirements-gliner.txt		requirements-gliner.txt
requirements.txt		requirements.txt
result_publish.py		result_publish.py
s3_bucket.py		s3_bucket.py
s3_client.py		s3_client.py
s3_config.py		s3_config.py
s3_integration.py		s3_integration.py
services.py		services.py
setup_nats_streams.py		setup_nats_streams.py
start_kg_gliner.sh		start_kg_gliner.sh
start_services.sh		start_services.sh
start_worker.sh		start_worker.sh
status_worker.sh		status_worker.sh
stop_kg_gliner.sh		stop_kg_gliner.sh
stop_worker.sh		stop_worker.sh
storage_simple_s3.py		storage_simple_s3.py
test_enrichment_conversion.py		test_enrichment_conversion.py
test_nats_connection.py		test_nats_connection.py
test_vlm_nats_integration.py		test_vlm_nats_integration.py
vram_policy.py		vram_policy.py
worker_nats_objectstore.py		worker_nats_objectstore.py
worker_runtime.py		worker_runtime.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Processing System

🏗️ Architecture Overview

📁 Directory Structure

🚀 Quick Start

1. Infrastructure Setup (NATS Server)

2. PDF Processing Service Setup (GPU Server)

3. Client Integration (Your Laptop)

🎛️ Configuration

Infrastructure Server (.env)

Processing Services (.env)

🔧 Service Types & Namespacing

📋 Deployment Scenarios

Scenario 1: Simple Setup

Scenario 2: Production Setup

Scenario 3: Development

🛡️ Security

🔄 Adding New Services

📖 Documentation

🧪 Testing

🎯 Key Benefits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Distributed Processing System

🏗️ Architecture Overview

📁 Directory Structure

🚀 Quick Start

1. Infrastructure Setup (NATS Server)

2. PDF Processing Service Setup (GPU Server)

3. Client Integration (Your Laptop)

🎛️ Configuration

Infrastructure Server (.env)

Processing Services (.env)

🔧 Service Types & Namespacing

📋 Deployment Scenarios

Scenario 1: Simple Setup

Scenario 2: Production Setup

Scenario 3: Development

🛡️ Security

🔄 Adding New Services

📖 Documentation

🧪 Testing

🎯 Key Benefits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages