Skip to content

climateandtech/pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

40 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Distributed Processing System

A generic distributed processing system using NATS JetStream for message routing and supporting multiple service types (PDF processing, image analysis, text processing, etc.).

๐Ÿ—๏ธ Architecture Overview

Infrastructure Server    Processing Servers         Client Applications
     (NATS)              (GPU, CPU, etc.)           (Laptop, Web, etc.)
        |                       |                           |
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚    NATS    โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ PDF Worker  โ”‚             โ”‚   Your App   โ”‚
   โ”‚ JetStream  โ”‚        โ”‚ Image Workerโ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚  (services.pyโ”‚
   โ”‚  Message   โ”‚        โ”‚ Text Worker โ”‚             โ”‚   thinktank2) โ”‚
   โ”‚   Broker   โ”‚        โ”‚     ...     โ”‚             โ”‚      ...     โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚                       โ”‚                           โ”‚
   Pure Messaging         Business Logic              Submit Requests

๐Ÿ“ Directory Structure

ct/
โ”œโ”€โ”€ infrastructure/          # ๐Ÿ—๏ธ Infrastructure components
โ”‚   โ””โ”€โ”€ nats-server/        # Pure NATS server (dedicated server)
โ”œโ”€โ”€ pdf/                    # ๐Ÿ“„ PDF processing service (GPU server)
โ”‚   โ”œโ”€โ”€ docling_worker.py   # Worker process
โ”‚   โ”œโ”€โ”€ services.py         # Client library
โ”‚   โ””โ”€โ”€ tests/             # Service tests
โ””โ”€โ”€ future_services/        # ๐Ÿ”ฎ Add more services as needed
    โ”œโ”€โ”€ image_processing/
    โ”œโ”€โ”€ text_analysis/
    โ””โ”€โ”€ audio_transcription/

๐Ÿš€ Quick Start

1. Infrastructure Setup (NATS Server)

On your dedicated NATS server:

cd infrastructure/nats-server/
./setup.sh
# Save the generated token - you'll need it for all services!

2. PDF Processing Service Setup (GPU Server)

On your GPU server:

cd pdf/
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Configure environment
cp environment_config.txt .env
# Edit .env with NATS server IP and token

# Start docling worker (docs.process.*)
./start_worker.sh

# Optional: GLiNER KG infer (kg.infer) โ€” same venv + extra deps
pip install -r requirements-gliner.txt
./start_kg_gliner.sh

GPU deploy (both workers): push to climateandtech/pdf main, then from coolify-provisioning/ run ./gpu-setup-production.sh (once) and ./gpu-deploy-worker.sh. See docs/GPU_PRODUCTION.md.

Platform calls kg.infer when KG_EXTRACT_ON_GPU=1 (no separate platform clone on GPU).

3. Client Integration (Your Laptop)

Connect your existing services.py:

# In your thinktank2 project
from pdf.services import DocumentService

# Configure to point to your NATS server
doc_service = DocumentService()
await doc_service.setup()

result = await doc_service.process_document(
    s3_key="documents/my-file.pdf",
    docling_options={...}
)

๐ŸŽ›๏ธ Configuration

Infrastructure Server (.env)

# Pure NATS configuration - no service specifics
NATS_TOKEN=your-generated-secure-token

Processing Services (.env)

# Points to your infrastructure
NATS_URL=nats://your-nats-server-ip:4222
NATS_TOKEN=your-generated-secure-token

# Service-specific settings
AWS_ACCESS_KEY_ID=your-s3-credentials
# ... etc

๐Ÿ”ง Service Types & Namespacing

Each service type gets its own namespace on the shared NATS server:

Service Type Stream Name Subject Prefix Worker Group
PDF Docling PDF_PROCESSING pdf.docling.* pdf_docling_workers
Image Processing IMAGE_PROCESSING image.process.* image_workers
Text Analysis TEXT_ANALYSIS text.analyze.* text_workers
Audio Transcription AUDIO_TRANSCRIPTION audio.transcribe.* audio_workers

๐Ÿ“‹ Deployment Scenarios

Scenario 1: Simple Setup

  • NATS Server: 1 dedicated server
  • PDF Processing: 1 GPU server
  • Clients: Your laptop

Scenario 2: Production Setup

  • NATS Cluster: 3 servers (HA)
  • PDF Workers: Multiple GPU servers (auto-scaling)
  • Image Workers: Multiple CPU servers
  • Clients: Web applications, mobile apps, etc.

Scenario 3: Development

  • NATS: Local Docker container
  • Workers: Local processes
  • Clients: Local development

๐Ÿ›ก๏ธ Security

  • Token Authentication: Secure token for NATS access
  • Network Isolation: Firewall rules for known IPs only
  • TLS: Optional TLS encryption for production
  • Separate Concerns: Infrastructure vs. business logic

๐Ÿ”„ Adding New Services

  1. Create service directory: mkdir new_service/
  2. Implement worker: Use existing patterns from pdf/
  3. Configure namespace: Add to generic_config.py
  4. Deploy: On appropriate servers (GPU, CPU, etc.)
  5. Connect: All services use the same NATS infrastructure

๐Ÿ“– Documentation

๐Ÿงช Testing

# Test infrastructure
cd infrastructure/nats-server/
# Connection tests included in setup

# Test PDF service
cd pdf/
pytest tests/ -v

# Test end-to-end
python -c "
import asyncio
from services import DocumentService

async def test():
    service = DocumentService()
    await service.setup()
    print('โœ… Connected to distributed system!')

asyncio.run(test())
"

๐ŸŽฏ Key Benefits

โœ… Scalable: Add processing power by adding servers
โœ… Flexible: Mix different service types on same infrastructure
โœ… Reliable: Dedicated message infrastructure
โœ… Maintainable: Clear separation of concerns
โœ… Future-proof: Easy to add new processing capabilities

Perfect for: Multi-modal AI processing, distributed computing, microservices architecture

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors