Skip to content

Benchify/rlm_benchmark

Repository files navigation

RLM Benchmark: Comparing Coding Agent Strategies

This repository contains experiments comparing three coding agent architectures on Terminal-Bench:

  1. Simple Agent - Direct problem-solving without decomposition
  2. Handrolled RLM - Forced 3-step decomposition (decompose → solve → synthesize)
  3. Canonical RLM - Adaptive decomposition using the RLM framework

Goal

Test whether recursive language models (RLMs) with formal planning improve coding agent performance compared to direct problem-solving approaches.

Setup

# Install dependencies (requires Python 3.12+)
uv sync

# Set up API key
echo "XAI_API_KEY=your_key_here" > .env

# Ensure Docker is running (required for Harbor benchmarks)
docker ps

Running Benchmarks

Run agents on Terminal-Bench-Sample (10 tasks):

# Simple agent
uv run harbor run -d terminal-bench-sample@2.0 --agent-import-path harbor_simple_agent:HarborSimpleAgent

# Handrolled RLM agent
uv run harbor run -d terminal-bench-sample@2.0 --agent-import-path harbor_rlm_agent:HarborRLMAgent

# Canonical RLM agent
uv run harbor run -d terminal-bench-sample@2.0 --agent-import-path harbor_canonical_rlm:HarborCanonicalRLM

Agent Implementations

Simple Agent (harbor_simple_agent.py)

  • Direct problem-solving with tool calling
  • Tools: read_file, write_file, exec_bash
  • No decomposition overhead
  • Natural iteration loop: write → test → fix

Handrolled RLM (harbor_rlm_agent.py)

  • Forced decomposition: always breaks task into subtasks
  • Fixed pattern: decompose → solve subtasks → synthesize
  • No model agency in choosing strategy
  • Uses recursive sub-agents

Canonical RLM (harbor_canonical_rlm.py)

  • Uses official RLM implementation
  • Model writes Python code in REPL environment
  • Adaptive: model chooses when/how to decompose
  • Access to: bash(), llm_query(), llm_query_batched()
  • Integrated with Harbor via nest_asyncio bridge

Key Technical Details

Harbor + Canonical RLM Integration

Canonical RLM executes synchronous Python code in a REPL, while Harbor's environment is async and Docker-based. We bridge this using nest_asyncio:

# In REPL setup_code:
def bash(command):
    loop = asyncio.get_event_loop()
    result = loop.run_until_complete(harbor_env.exec(command))
    return result

This allows the model's sync code to call async Harbor functions seamlessly.

Results

See RESULTS_SUMMARY.md for detailed analysis.

Preview:

  • Simple Agent: 14% success rate (1/7 tasks)
  • Handrolled RLM: 0% success rate (0/8 tasks)
  • Canonical RLM: Running...

Files

Core implementations:

  • harbor_simple_agent.py - Simple direct agent
  • harbor_rlm_agent.py - Forced decomposition agent
  • harbor_canonical_rlm.py - Adaptive RLM agent

Standalone experiments:

  • simple_agent.py - Simple agent (no Harbor)
  • rlm_agent.py - RLM agent (no Harbor)
  • compare_agents.py - Synthetic task comparison

Documentation:

  • RESULTS_SUMMARY.md - Comprehensive results and analysis
  • pyproject.toml - Dependencies

Dependencies

  • Harbor - Agent benchmarking framework
  • RLM - Recursive Language Model implementation
  • nest_asyncio - Nested event loop support
  • X.AI Grok API (or other LiteLLM-compatible provider)

License

MIT

About

Are RLMs good at coding? We decided to investigate. (WIP)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages