secure-agent/STATUS.md

# Coding Agent - Project Status

**Last Updated:** 2026-02-21

## 🎯 Current State: MVP Complete

The agent is functional and can write, read, and execute code in an isolated sandbox.

---

## ✅ What Works

### Core Infrastructure
- **Sandbox**: Podman + libkrun (microVM isolation)
  - Network disabled
  - Workspace mounted at `/workspace`
  - Bind mount to `./workspace` on host
  - 512MB memory limit

- **Agent Loop**: Streaming responses with tool visibility
  - Shows tool calls as they happen (`🔧 Running: bash(...)`)
  - Streams text token-by-token
  - Handles tool → response → tool chains

- **Persistent History**: JSON files in `./history/`
  - Format: `YYYYMMDD-HHMMSS.json`
  - Includes timestamps
  - Auto-saves after each message

### Tools Available
1. **`bash`** - Execute shell commands in sandbox
2. **`read_file`** - Read file contents from workspace
3. **`write_file`** - Write/create files in workspace (creates parent dirs)

### Testing
- Unit tests: 13 passing
- Integration tests: 5 passing
- All critical paths covered

---

## ⚠️ Known Issues

### 1. Output Corruption Bug
**Symptom:** `ls -la` output shows `�total 40` with leading spaces/corrupt bytes

**Status:** Investigated, not yet fixed

**Workaround:** Output is still readable despite corruption

**Notes:**
- `demux=False` is set but not working as expected
- May be Podman SDK version issue or multiplex header stripping problem
- Planned to let agent debug itself once fully operational

### 2. Multi-line Paste
**Symptom:** Pasting multi-line text causes each line to be processed as separate input

**Status:** Known limitation of CLI `input()`

**Workaround:** Don't paste multi-line prompts until TUI is built

**Solution:** Build Textual TUI (planned)

---

## 🚀 Quick Start

### Prerequisites
```bash
# Podman with krun runtime installed
# Python 3.14
# uv package manager
```

### Run the Agent
```bash
# Create workspace
mkdir -p workspace

# Set up .env
cat > .env << EOF
ANTHROPIC_API_KEY=sk-ant-...
MODEL=claude-sonnet-4-5-20250929
MAX_TOKENS=8096
SAFEDIR=./workspace
USE_SANDBOX=true
EOF

# Start agent
uv run python main.py
```

### Commands
- `/quit` - Exit session
- `/clear` - Clear conversation history
- `/help` - Show available commands

---

## 📂 Project Structure
```
coding-agent/
├── main.py                  # Entry point
├── agent/
│   ├── config.py           # Settings (Pydantic)
│   ├── loop.py             # run_turn, run_session
│   ├── history.py          # ConversationHistory
│   └── tools.py            # TOOL_SCHEMAS, dispatch_tool
├── sandbox/
│   └── session.py          # PodmanSandbox
├── tools/
│   ├── bash.py             # bash tool
│   └── files.py            # read_file, write_file
├── tests/
│   ├── conftest.py         # Shared fixtures
│   ├── test_config.py
│   ├── test_loop.py
│   ├── test_sandbox.py
│   └── test_files.py
├── workspace/              # Agent's workspace (gitignored)
└── history/                # Session history (gitignored)
```

---

## 🔧 Development

### Run Tests
```bash
# All tests
uv run pytest

# Unit tests only (fast, no sandbox)
uv run pytest -m unit

# Integration tests (requires sandbox)
uv run pytest -m integration

# Specific file
uv run pytest tests/test_files.py -v
```

### Add a New Tool

1. **Create tool implementation:**
```python
# tools/my_tool.py
import asyncio

async def my_tool(param: str, sandbox=None) -> str:
    """Tool description."""
    if sandbox is None:
        return "Error: No sandbox available"

    try:
        result = await asyncio.to_thread(sandbox.run, f"some command {param}")
        return result
    except Exception as e:
        return f"Error: {e}"

# Tool schema
MY_TOOL_SCHEMA = {
    "name": "my_tool",
    "description": "What this tool does",
    "input_schema": {
        "type": "object",
        "properties": {
            "param": {
                "type": "string",
                "description": "Parameter description"
            }
        },
        "required": ["param"]
    }
}
```

2. **Export from `tools/__init__.py`:**
```python
from tools.my_tool import my_tool, MY_TOOL_SCHEMA

TOOL_SCHEMAS = [
    BASH_SCHEMA,
    READ_FILE_SCHEMA,
    WRITE_FILE_SCHEMA,
    MY_TOOL_SCHEMA,  # Add here
]
```

3. **Add to dispatcher:**
```python
# agent/tools.py
from tools import my_tool

async def dispatch_tool(tool_name: str, tool_input: dict, sandbox=None):
    # ...
    elif tool_name == "my_tool":
        return await my_tool(tool_input["param"], sandbox=sandbox)
```

4. **Write tests:**
```python
# tests/test_my_tool.py
@pytest.mark.unit
async def test_my_tool_no_sandbox():
    result = await my_tool("test", sandbox=None)
    assert "error" in result.lower()

@pytest.mark.integration
async def test_my_tool_works():
    async with PodmanSandbox() as sb:
        result = await my_tool("test", sb)
        assert "expected output" in result
```

---

## 🎯 Next Steps (Priority Order)

### Immediate (Make Agent More Useful)
1. **Fix output corruption bug**
   - Let agent debug itself with current tools
   - Or investigate Podman SDK version/settings

2. **Add more file tools** (optional enhancements)
   - `list_files(directory)` - better than `bash("ls")`
   - `search_files(pattern)` - grep with nice output
   - `edit_file(filepath, old, new)` - targeted edits

### Short Term (Better UX)
3. **Session resume**
   - Add `/load <session-id>` command
   - ~10 minutes of work

4. **Build Textual TUI**
   - Multi-line input support
   - Better history viewing
   - Collapsible tool output
   - ~3-4 hours

### Medium Term (Collaboration Features)
5. **Git integration (host-side tools)**
   - `git_clone(repo)` - uses your SSH keys
   - `git_push(branch)` - uses your credentials
   - `create_pr(title, body)` - uses GitHub/Gitea API
   - Agent works on feature branches, you review PRs
   - ~2-3 hours

6. **Improve error messages**
   - Better tool error reporting
   - Exit codes visible to agent
   - ~1 hour

### Long Term (Advanced Features)
7. **Web API interface**
   - FastAPI + SSE for streaming
   - Multi-user support (separate sandboxes)
   - ~4-6 hours

8. **Custom base image**
   - Pre-install common packages
   - Faster startup
   - ~1-2 hours

9. **Tool call optimization**
   - Batch related operations
   - Cache frequent commands
   - ~2-3 hours

---

## 🧪 Testing the Agent

### Simple Task
```
You: Create a Python script that prints "Hello, World!"
```

Expected: Agent writes file, shows content, runs it, shows output.

### Medium Task
```
You: Create a Flask API with /health endpoint that returns {"status": "ok"}
     Include requirements.txt
```

Expected: Agent writes app.py, requirements.txt, installs flask, tests the endpoint.

### Complex Task
```
You: Create a data processing script that:
     1. Reads a CSV file
     2. Filters rows where value > 100
     3. Saves to new CSV

     Include sample data and tests
```

Expected: Agent writes script, creates sample data, writes tests, runs everything.

---

## 📝 Notes

### Why Podman + krun?
- **VM-level isolation** (not just containers)
- **Daemonless** (no background service)
- **Rootless** by default
- **Docker-compatible** API
- Fast startup (~125ms)

### Why Not Docker?
- Container isolation only (not VM)
- Requires daemon
- Podman is drop-in replacement with better security

### Why Not microsandbox?
- Promising but immature (SDK version mismatches)
- Podman + krun uses same underlying tech (libkrun)
- More stable ecosystem
- Can revisit microsandbox in 6-12 months

### Sandbox Security Model
- **Network disabled** - agent can't exfiltrate data
- **Workspace mount** - only way to persist files
- **Ephemeral VM** - destroyed after session
- **Host git** - credentials never in sandbox
- Agent works on feature branches, you review PRs

### Design Decisions
- **Streaming vs batched** - Streaming for better UX
- **One tool per file** - Clear organization, easy to find
- **Schemas with tools** - Keep related code together
- **Keyword args for sandbox** - More maintainable
- **JSON history** - Human-readable, git-friendly
- **Async throughout** - Future-proof for web API

---

## 🤝 Contributing (Future)

When ready to open-source:
1. Add proper README
2. Add LICENSE (MIT recommended)
3. Add CONTRIBUTING.md
4. Set up CI/CD (GitHub Actions)
5. Add pre-commit hooks
6. Document MCP integration path

---

## 📚 Key Learnings

### What Worked Well
- **Layered architecture** - Easy to add features on top
- **Testing from the start** - Caught issues early
- **Simple tools first** - bash/read/write covers 90% of needs
- **Integration tests** - More valuable than complex unit tests

### What Was Hard
- **Async/sync boundaries** - `asyncio.to_thread` for podman SDK
- **Streaming API** - Required rewriting entire request flow
- **Mock complexity** - Some unit tests not worth the effort
- **Version mismatches** - microsandbox SDK vs server

### Surprises
- **Podman multiplex headers** - Unexpected output corruption
- **Multi-line paste** - CLI input() limitation
- **Test refactoring** - Changing streaming broke all tests
- **Path validation** - More edge cases than expected

---

## 🔗 Useful Links

- **Anthropic API Docs**: https://docs.anthropic.com
- **Podman Python SDK**: https://podman-py.readthedocs.io
- **Textual TUI**: https://textual.textualize.io
- **Pydantic**: https://docs.pydantic.dev

---

*This is a working MVP. The agent can write, read, and execute code safely. Everything else is enhancement.*