diff --git a/STATUS.md b/STATUS.md new file mode 100644 index 0000000..2d26acd --- /dev/null +++ b/STATUS.md @@ -0,0 +1,384 @@ +# Coding Agent - Project Status + +**Last Updated:** 2026-02-21 + +## 🎯 Current State: MVP Complete + +The agent is functional and can write, read, and execute code in an isolated sandbox. + +--- + +## βœ… What Works + +### Core Infrastructure +- **Sandbox**: Podman + libkrun (microVM isolation) + - Network disabled + - Workspace mounted at `/workspace` + - Bind mount to `./workspace` on host + - 512MB memory limit + +- **Agent Loop**: Streaming responses with tool visibility + - Shows tool calls as they happen (`πŸ”§ Running: bash(...)`) + - Streams text token-by-token + - Handles tool β†’ response β†’ tool chains + +- **Persistent History**: JSON files in `./history/` + - Format: `YYYYMMDD-HHMMSS.json` + - Includes timestamps + - Auto-saves after each message + +### Tools Available +1. **`bash`** - Execute shell commands in sandbox +2. **`read_file`** - Read file contents from workspace +3. **`write_file`** - Write/create files in workspace (creates parent dirs) + +### Testing +- Unit tests: 13 passing +- Integration tests: 5 passing +- All critical paths covered + +--- + +## ⚠️ Known Issues + +### 1. Output Corruption Bug +**Symptom:** `ls -la` output shows `οΏ½total 40` with leading spaces/corrupt bytes + +**Status:** Investigated, not yet fixed + +**Workaround:** Output is still readable despite corruption + +**Notes:** +- `demux=False` is set but not working as expected +- May be Podman SDK version issue or multiplex header stripping problem +- Planned to let agent debug itself once fully operational + +### 2. Multi-line Paste +**Symptom:** Pasting multi-line text causes each line to be processed as separate input + +**Status:** Known limitation of CLI `input()` + +**Workaround:** Don't paste multi-line prompts until TUI is built + +**Solution:** Build Textual TUI (planned) + +--- + +## πŸš€ Quick Start + +### Prerequisites +```bash +# Podman with krun runtime installed +# Python 3.14 +# uv package manager +``` + +### Run the Agent +```bash +# Create workspace +mkdir -p workspace + +# Set up .env +cat > .env << EOF +ANTHROPIC_API_KEY=sk-ant-... +MODEL=claude-sonnet-4-5-20250929 +MAX_TOKENS=8096 +SAFEDIR=./workspace +USE_SANDBOX=true +EOF + +# Start agent +uv run python main.py +``` + +### Commands +- `/quit` - Exit session +- `/clear` - Clear conversation history +- `/help` - Show available commands + +--- + +## πŸ“‚ Project Structure +``` +coding-agent/ +β”œβ”€β”€ main.py # Entry point +β”œβ”€β”€ agent/ +β”‚ β”œβ”€β”€ config.py # Settings (Pydantic) +β”‚ β”œβ”€β”€ loop.py # run_turn, run_session +β”‚ β”œβ”€β”€ history.py # ConversationHistory +β”‚ └── tools.py # TOOL_SCHEMAS, dispatch_tool +β”œβ”€β”€ sandbox/ +β”‚ └── session.py # PodmanSandbox +β”œβ”€β”€ tools/ +β”‚ β”œβ”€β”€ bash.py # bash tool +β”‚ └── files.py # read_file, write_file +β”œβ”€β”€ tests/ +β”‚ β”œβ”€β”€ conftest.py # Shared fixtures +β”‚ β”œβ”€β”€ test_config.py +β”‚ β”œβ”€β”€ test_loop.py +β”‚ β”œβ”€β”€ test_sandbox.py +β”‚ └── test_files.py +β”œβ”€β”€ workspace/ # Agent's workspace (gitignored) +└── history/ # Session history (gitignored) +``` + +--- + +## πŸ”§ Development + +### Run Tests +```bash +# All tests +uv run pytest + +# Unit tests only (fast, no sandbox) +uv run pytest -m unit + +# Integration tests (requires sandbox) +uv run pytest -m integration + +# Specific file +uv run pytest tests/test_files.py -v +``` + +### Add a New Tool + +1. **Create tool implementation:** +```python +# tools/my_tool.py +import asyncio + +async def my_tool(param: str, sandbox=None) -> str: + """Tool description.""" + if sandbox is None: + return "Error: No sandbox available" + + try: + result = await asyncio.to_thread(sandbox.run, f"some command {param}") + return result + except Exception as e: + return f"Error: {e}" + +# Tool schema +MY_TOOL_SCHEMA = { + "name": "my_tool", + "description": "What this tool does", + "input_schema": { + "type": "object", + "properties": { + "param": { + "type": "string", + "description": "Parameter description" + } + }, + "required": ["param"] + } +} +``` + +2. **Export from `tools/__init__.py`:** +```python +from tools.my_tool import my_tool, MY_TOOL_SCHEMA + +TOOL_SCHEMAS = [ + BASH_SCHEMA, + READ_FILE_SCHEMA, + WRITE_FILE_SCHEMA, + MY_TOOL_SCHEMA, # Add here +] +``` + +3. **Add to dispatcher:** +```python +# agent/tools.py +from tools import my_tool + +async def dispatch_tool(tool_name: str, tool_input: dict, sandbox=None): + # ... + elif tool_name == "my_tool": + return await my_tool(tool_input["param"], sandbox=sandbox) +``` + +4. **Write tests:** +```python +# tests/test_my_tool.py +@pytest.mark.unit +async def test_my_tool_no_sandbox(): + result = await my_tool("test", sandbox=None) + assert "error" in result.lower() + +@pytest.mark.integration +async def test_my_tool_works(): + async with PodmanSandbox() as sb: + result = await my_tool("test", sb) + assert "expected output" in result +``` + +--- + +## 🎯 Next Steps (Priority Order) + +### Immediate (Make Agent More Useful) +1. **Fix output corruption bug** + - Let agent debug itself with current tools + - Or investigate Podman SDK version/settings + +2. **Add more file tools** (optional enhancements) + - `list_files(directory)` - better than `bash("ls")` + - `search_files(pattern)` - grep with nice output + - `edit_file(filepath, old, new)` - targeted edits + +### Short Term (Better UX) +3. **Session resume** + - Add `/load ` command + - ~10 minutes of work + +4. **Build Textual TUI** + - Multi-line input support + - Better history viewing + - Collapsible tool output + - ~3-4 hours + +### Medium Term (Collaboration Features) +5. **Git integration (host-side tools)** + - `git_clone(repo)` - uses your SSH keys + - `git_push(branch)` - uses your credentials + - `create_pr(title, body)` - uses GitHub/Gitea API + - Agent works on feature branches, you review PRs + - ~2-3 hours + +6. **Improve error messages** + - Better tool error reporting + - Exit codes visible to agent + - ~1 hour + +### Long Term (Advanced Features) +7. **Web API interface** + - FastAPI + SSE for streaming + - Multi-user support (separate sandboxes) + - ~4-6 hours + +8. **Custom base image** + - Pre-install common packages + - Faster startup + - ~1-2 hours + +9. **Tool call optimization** + - Batch related operations + - Cache frequent commands + - ~2-3 hours + +--- + +## πŸ§ͺ Testing the Agent + +### Simple Task +``` +You: Create a Python script that prints "Hello, World!" +``` + +Expected: Agent writes file, shows content, runs it, shows output. + +### Medium Task +``` +You: Create a Flask API with /health endpoint that returns {"status": "ok"} + Include requirements.txt +``` + +Expected: Agent writes app.py, requirements.txt, installs flask, tests the endpoint. + +### Complex Task +``` +You: Create a data processing script that: + 1. Reads a CSV file + 2. Filters rows where value > 100 + 3. Saves to new CSV + + Include sample data and tests +``` + +Expected: Agent writes script, creates sample data, writes tests, runs everything. + +--- + +## πŸ“ Notes + +### Why Podman + krun? +- **VM-level isolation** (not just containers) +- **Daemonless** (no background service) +- **Rootless** by default +- **Docker-compatible** API +- Fast startup (~125ms) + +### Why Not Docker? +- Container isolation only (not VM) +- Requires daemon +- Podman is drop-in replacement with better security + +### Why Not microsandbox? +- Promising but immature (SDK version mismatches) +- Podman + krun uses same underlying tech (libkrun) +- More stable ecosystem +- Can revisit microsandbox in 6-12 months + +### Sandbox Security Model +- **Network disabled** - agent can't exfiltrate data +- **Workspace mount** - only way to persist files +- **Ephemeral VM** - destroyed after session +- **Host git** - credentials never in sandbox +- Agent works on feature branches, you review PRs + +### Design Decisions +- **Streaming vs batched** - Streaming for better UX +- **One tool per file** - Clear organization, easy to find +- **Schemas with tools** - Keep related code together +- **Keyword args for sandbox** - More maintainable +- **JSON history** - Human-readable, git-friendly +- **Async throughout** - Future-proof for web API + +--- + +## 🀝 Contributing (Future) + +When ready to open-source: +1. Add proper README +2. Add LICENSE (MIT recommended) +3. Add CONTRIBUTING.md +4. Set up CI/CD (GitHub Actions) +5. Add pre-commit hooks +6. Document MCP integration path + +--- + +## πŸ“š Key Learnings + +### What Worked Well +- **Layered architecture** - Easy to add features on top +- **Testing from the start** - Caught issues early +- **Simple tools first** - bash/read/write covers 90% of needs +- **Integration tests** - More valuable than complex unit tests + +### What Was Hard +- **Async/sync boundaries** - `asyncio.to_thread` for podman SDK +- **Streaming API** - Required rewriting entire request flow +- **Mock complexity** - Some unit tests not worth the effort +- **Version mismatches** - microsandbox SDK vs server + +### Surprises +- **Podman multiplex headers** - Unexpected output corruption +- **Multi-line paste** - CLI input() limitation +- **Test refactoring** - Changing streaming broke all tests +- **Path validation** - More edge cases than expected + +--- + +## πŸ”— Useful Links + +- **Anthropic API Docs**: https://docs.anthropic.com +- **Podman Python SDK**: https://podman-py.readthedocs.io +- **Textual TUI**: https://textual.textualize.io +- **Pydantic**: https://docs.pydantic.dev + +--- + +*This is a working MVP. The agent can write, read, and execute code safely. Everything else is enhancement.*