This repository contains experiments comparing three coding agent architectures on Terminal-Bench:
- Simple Agent - Direct problem-solving without decomposition
- Handrolled RLM - Forced 3-step decomposition (decompose → solve → synthesize)
- Canonical RLM - Adaptive decomposition using the RLM framework
Test whether recursive language models (RLMs) with formal planning improve coding agent performance compared to direct problem-solving approaches.
# Install dependencies (requires Python 3.12+)
uv sync
# Set up API key
echo "XAI_API_KEY=your_key_here" > .env
# Ensure Docker is running (required for Harbor benchmarks)
docker psRun agents on Terminal-Bench-Sample (10 tasks):
# Simple agent
uv run harbor run -d terminal-bench-sample@2.0 --agent-import-path harbor_simple_agent:HarborSimpleAgent
# Handrolled RLM agent
uv run harbor run -d terminal-bench-sample@2.0 --agent-import-path harbor_rlm_agent:HarborRLMAgent
# Canonical RLM agent
uv run harbor run -d terminal-bench-sample@2.0 --agent-import-path harbor_canonical_rlm:HarborCanonicalRLM- Direct problem-solving with tool calling
- Tools: read_file, write_file, exec_bash
- No decomposition overhead
- Natural iteration loop: write → test → fix
- Forced decomposition: always breaks task into subtasks
- Fixed pattern: decompose → solve subtasks → synthesize
- No model agency in choosing strategy
- Uses recursive sub-agents
- Uses official RLM implementation
- Model writes Python code in REPL environment
- Adaptive: model chooses when/how to decompose
- Access to:
bash(),llm_query(),llm_query_batched() - Integrated with Harbor via
nest_asynciobridge
Canonical RLM executes synchronous Python code in a REPL, while Harbor's environment is async and Docker-based. We bridge this using nest_asyncio:
# In REPL setup_code:
def bash(command):
loop = asyncio.get_event_loop()
result = loop.run_until_complete(harbor_env.exec(command))
return resultThis allows the model's sync code to call async Harbor functions seamlessly.
See RESULTS_SUMMARY.md for detailed analysis.
Preview:
- Simple Agent: 14% success rate (1/7 tasks)
- Handrolled RLM: 0% success rate (0/8 tasks)
- Canonical RLM: Running...
Core implementations:
harbor_simple_agent.py- Simple direct agentharbor_rlm_agent.py- Forced decomposition agentharbor_canonical_rlm.py- Adaptive RLM agent
Standalone experiments:
simple_agent.py- Simple agent (no Harbor)rlm_agent.py- RLM agent (no Harbor)compare_agents.py- Synthetic task comparison
Documentation:
RESULTS_SUMMARY.md- Comprehensive results and analysispyproject.toml- Dependencies
- Harbor - Agent benchmarking framework
- RLM - Recursive Language Model implementation
- nest_asyncio - Nested event loop support
- X.AI Grok API (or other LiteLLM-compatible provider)
MIT