RLM Benchmark: Comparing Coding Agent Strategies

This repository contains experiments comparing three coding agent architectures on Terminal-Bench:

Simple Agent - Direct problem-solving without decomposition
Handrolled RLM - Forced 3-step decomposition (decompose → solve → synthesize)
Canonical RLM - Adaptive decomposition using the RLM framework

Goal

Test whether recursive language models (RLMs) with formal planning improve coding agent performance compared to direct problem-solving approaches.

Setup

# Install dependencies (requires Python 3.12+)
uv sync

# Set up API key
echo "XAI_API_KEY=your_key_here" > .env

# Ensure Docker is running (required for Harbor benchmarks)
docker ps

Running Benchmarks

Run agents on Terminal-Bench-Sample (10 tasks):

# Simple agent
uv run harbor run -d terminal-bench-sample@2.0 --agent-import-path harbor_simple_agent:HarborSimpleAgent

# Handrolled RLM agent
uv run harbor run -d terminal-bench-sample@2.0 --agent-import-path harbor_rlm_agent:HarborRLMAgent

# Canonical RLM agent
uv run harbor run -d terminal-bench-sample@2.0 --agent-import-path harbor_canonical_rlm:HarborCanonicalRLM

Agent Implementations

Simple Agent (`harbor_simple_agent.py`)

Direct problem-solving with tool calling
Tools: read_file, write_file, exec_bash
No decomposition overhead
Natural iteration loop: write → test → fix

Handrolled RLM (`harbor_rlm_agent.py`)

Forced decomposition: always breaks task into subtasks
Fixed pattern: decompose → solve subtasks → synthesize
No model agency in choosing strategy
Uses recursive sub-agents

Canonical RLM (`harbor_canonical_rlm.py`)

Uses official RLM implementation
Model writes Python code in REPL environment
Adaptive: model chooses when/how to decompose
Access to: bash(), llm_query(), llm_query_batched()
Integrated with Harbor via nest_asyncio bridge

Key Technical Details

Harbor + Canonical RLM Integration

Canonical RLM executes synchronous Python code in a REPL, while Harbor's environment is async and Docker-based. We bridge this using nest_asyncio:

# In REPL setup_code:
def bash(command):
    loop = asyncio.get_event_loop()
    result = loop.run_until_complete(harbor_env.exec(command))
    return result

This allows the model's sync code to call async Harbor functions seamlessly.

Results

See RESULTS_SUMMARY.md for detailed analysis.

Preview:

Simple Agent: 14% success rate (1/7 tasks)
Handrolled RLM: 0% success rate (0/8 tasks)
Canonical RLM: Running...

Files

Core implementations:

harbor_simple_agent.py - Simple direct agent
harbor_rlm_agent.py - Forced decomposition agent
harbor_canonical_rlm.py - Adaptive RLM agent

Standalone experiments:

simple_agent.py - Simple agent (no Harbor)
rlm_agent.py - RLM agent (no Harbor)
compare_agents.py - Synthetic task comparison

Documentation:

RESULTS_SUMMARY.md - Comprehensive results and analysis
pyproject.toml - Dependencies

Dependencies

Harbor - Agent benchmarking framework
RLM - Recursive Language Model implementation
nest_asyncio - Nested event loop support
X.AI Grok API (or other LiteLLM-compatible provider)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RLM Benchmark: Comparing Coding Agent Strategies

Goal

Setup

Running Benchmarks

Agent Implementations

Simple Agent (`harbor_simple_agent.py`)

Handrolled RLM (`harbor_rlm_agent.py`)

Canonical RLM (`harbor_canonical_rlm.py`)

Key Technical Details

Harbor + Canonical RLM Integration

Results

Files

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
RESULTS_SUMMARY.md		RESULTS_SUMMARY.md
compare_agents.py		compare_agents.py
harbor_canonical_rlm.py		harbor_canonical_rlm.py
harbor_rlm_agent.py		harbor_rlm_agent.py
harbor_simple_agent.py		harbor_simple_agent.py
pyproject.toml		pyproject.toml
rlm_agent.py		rlm_agent.py
simple_agent.py		simple_agent.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

RLM Benchmark: Comparing Coding Agent Strategies

Goal

Setup

Running Benchmarks

Agent Implementations

Simple Agent (harbor_simple_agent.py)

Handrolled RLM (harbor_rlm_agent.py)

Canonical RLM (harbor_canonical_rlm.py)

Key Technical Details

Harbor + Canonical RLM Integration

Results

Files

Dependencies

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Simple Agent (`harbor_simple_agent.py`)

Handrolled RLM (`harbor_rlm_agent.py`)

Canonical RLM (`harbor_canonical_rlm.py`)

Packages