◉ 2026-04-25 #homelab#go#devops ← back to articles

Replacing Several Python Containers with a Single Go Binary

Why I replaced homelab-ops, the repair agent, the management console, and the Claude SDK agent runner with one self-contained Go service.

The Container Sprawl Problem

At some point I counted the Python containers running in my homelab just for internal tooling — not media servers, not monitoring, not Home Assistant, just my own custom code — and the number was higher than I wanted to admit.

Each one followed the same pattern: a Dockerfile that installed Python, copied in a requirements.txt, ran pip install, and started some process. Each one needed a base image to keep updated. Each one had its own virtual environment baked into the image layer. And because they were all small focused tools, none of them were complicated enough to justify the overhead.

The main offenders were homelab-ops, the repair agent, the agent command center, and the terminal agent. The solution was to replace the coordination layer — the part that all the others talked to — with a single Go binary called platform.

What the Tools Did

homelab-ops was a FastAPI service that handled GitOps for the homelab. Gitea would fire a webhook on every push to main, homelab-ops would look at which files changed, and trigger the appropriate Ansible playbook or Terraform module. It had a web dashboard showing job history, a locking system to prevent concurrent deployments, and a handful of API endpoints you could call manually. About 600 lines of Python across multiple files, plus Jinja2 templates for the dashboard.

repair-agent was the AI-powered self-repair service I wrote about a few weeks ago. When Uptime Kuma detected an outage, it would fire a webhook through n8n to the repair agent, which asked Ollama to pick an action from a whitelisted set of safe commands — restart a container, check disk space, pull logs — then SSH'd into the appropriate homelab machine to execute it. It logged everything to an audit trail and pushed notifications back to Home Assistant. FastAPI, asyncssh, httpx.

agent-command-center (ACC) was the management console for homelab AI services. It maintained a registry of running agents, tracked a task queue in SQLite, routed submitted tasks to the best available agent based on capabilities and load, and served a Vue frontend for visibility into what was running. It also had a background loop that synced task state from claude-memory, which is how the Claude SDK agent picked up work. Another FastAPI service — and critically, the one that every other service in this list talked to.

terminal-agent is a FastAPI service that executes tasks by calling Claude through the Agent SDK, with MCP tools attached for homelab and n8n access. It registers with the ACC on startup, pulls tasks from the queue, runs them as Claude SDK sessions, and streams logs over SSE. It's the thing that actually does work when you submit a prompt to the management console.

Four containers, four Python runtimes, four requirements.txt files. But the bigger issue was the ACC itself: it had grown into a tangle of async database calls, memory sync loops, and middleware that was harder to follow than it needed to be. It was time to replace the coordination layer with something cleaner.

Why Go

I picked up Go basics a couple of years ago and wanted to get back to it with a real project — something with enough surface area to actually learn from, not just a toy script. The homelab coordination layer was a good fit. The reasons to use Go here were also mostly practical.

Go produces a single statically-linked binary. You build it once, copy it anywhere, and it runs. No interpreter, no virtualenv, no system Python version to manage. The binary for the platform is under 20MB including the embedded web UI.

The standard library HTTP server is production-quality. I wasn't going to miss FastAPI — net/http plus chi for routing covers everything I need, and the result is a server that starts in milliseconds and uses a fraction of the memory.

Goroutines made the background health loop trivial. In the Python version of homelab-ops I used FastAPI's lifespan and asyncio for background tasks. In Go it's just go reg.RunHealthLoop(ctx) in main — a goroutine that ticks every 30 seconds, marks agents offline if their heartbeat lapses, and exits cleanly when the context is cancelled.

And critically: Go can embed static files directly into the binary at build time, which meant I could ship the web UI without a separate deployment step.

What the Platform Does

The platform is the Go replacement for the agent-command-center. Its job is the same — track which agents are online, route tasks to them, provide a management UI — but without the async complexity the Python version had accumulated. The terminal agent registers on startup exactly as it did before:

POST /api/v1/agents/register
{
  "agent_id":    "terminal-agent-core",
  "host":        "homelab-core.local",
  "port":        8350,
  "capabilities": ["claude", "homelab-mcp", "n8n-mcp"],
  "mcp_servers": ["mcp.local.brianrogers.dev"]
}

After that they send heartbeats every 30 seconds with CPU/memory load and active task count. The platform uses this to maintain a real-time view of what's running and route new tasks to the best-suited agent:

// SelectAgent picks the best available agent using weighted scoring.
// Weights: active tasks (0.4), cpu+mem load (0.2)
func score(a *Agent) float64 {
    taskScore := 1.0 - clamp(float64(a.ActiveTasks)/10.0, 0, 1)
    loadScore  := 1.0 - clamp((a.CPUPercent+a.MemoryPercent)/200.0, 0, 1)
    return taskScore*0.4 + loadScore*0.2
}

Tasks submitted to the platform get persisted in SQLite, assigned to an agent, and their lifecycle tracked through queued → assigned → running → completed (or failed, with configurable retries). The web UI proxies through to the agent's own task endpoints so you can see logs and results without having to know which host the agent is on.

One Binary, Two Interfaces

The binary is the server. Environment variables configure it:

ADDR=:8000       # listen address
DB_PATH=acc.db   # SQLite database
API_KEY=secret   # bearer token for auth (optional)

The "CLI" side is the accclient Go package — a client library any service can use to register as an agent, send heartbeats, and interact with the task queue. The terminal agent already had a Python acc_client.py that did exactly this against the old FastAPI ACC. Swapping the endpoint to point at the Go platform required no changes on the terminal agent side — the API contract is the same.

The repair agent works the same way: it registers, listens for repair tasks routed to it by the platform, executes them via SSH, and posts results back. Instead of being a standalone FastAPI service that n8n calls directly, it becomes a proper registered agent. The platform handles routing, visibility, and retry logic — the repair agent just needs to do its job.

The Embedded Web UI

The Python ACC served its frontend by mounting a frontend/dist/ directory as a static files route — which meant you had to build the Vue app separately, keep it in the right place, and make sure the path was mounted correctly in Docker. The Go platform does this better. The web UI is a Vue app compiled to dist/ at build time, then embedded directly into the binary:

// web/embed.go
package web

import "embed"

//go:embed dist
var Dist embed.FS

The router serves the static files from the embedded FS, and injects the API key into the HTML at request time so the frontend can authenticate without baking credentials into the build:

html := strings.Replace(
    string(rawHTML),
    "</head>",
    `<meta name="acc-api-key" content="`+apiKey+`"></head>`,
    1,
)

The result: one binary that serves its own management UI. No nginx, no separate static file hosting, no "remember to deploy the frontend too." scp platform homelab-core:/usr/local/bin/ and it's done.

SQLite Was the Right Call

homelab-ops used SQLite too, but via Python's aiosqlite. The Go version uses modernc.org/sqlite, which is a pure-Go SQLite port with no CGO dependency. This matters because CGO complicates cross-compilation, and I build for Linux/amd64 from my Mac. With a pure-Go SQLite driver, GOOS=linux GOARCH=amd64 go build just works.

For the scale of a homelab — a few agents, a few hundred tasks in the history — SQLite is more than enough. The database file is a single acc.db that I can inspect with any SQLite browser, back up with cp, and version with whatever retention policy I want. It's the right level of complexity for the problem.

Before and After

The before state: four services in core/docker-compose.yml, each with its own build context, image, environment block, and health check. Each one pulling a Python base image, installing dependencies, mounting config files.

  homelab-ops:
    build: ./services/homelab-ops
    restart: unless-stopped
    environment:
      - ANSIBLE_INVENTORY=/ansible/inventory.yml
      ...

  repair-agent:
    build: ./services/repair-agent
    restart: unless-stopped
    environment:
      - OLLAMA_URL=http://homelab-ai.local:11434
      - SSH_KEY_PATH=/app/ssh/id_rsa
      ...

  agent-command-center:
    build: ./services/agent-command-center
    restart: unless-stopped
    volumes:
      - ./services/agent-command-center/frontend/dist:/app/frontend/dist
      ...

  terminal-agent:
    build: ./services/terminal-agent
    restart: unless-stopped
    ...

The after state: one multi-stage Docker build that produces a minimal Alpine image, deployed to k3s.

FROM node:24-alpine AS ui
WORKDIR /web
COPY web/package*.json ./
RUN npm ci && npm run build

FROM golang:1.25-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
COPY --from=ui /web/dist ./web/dist
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /out/platform ./cmd/platform

FROM alpine:3.21
RUN apk add --no-cache ca-certificates
COPY --from=build /out/platform /platform
EXPOSE 8000
USER nobody
ENTRYPOINT ["/platform"]

The final image is Alpine plus a single static binary. No Python runtime, no pip layers, no virtualenv. It's also worth noting that CGO_ENABLED=0 isn't just a convenience — because the SQLite driver is pure Go, it's actually correct. The binary cross-compiles cleanly for amd64, arm64, and armv6 (the Raspberry Pi) from the same source tree with no toolchain gymnastics.

For contexts where Docker isn't in the picture — provisioning a new machine, running on the Pi — the Makefile's cross target produces platform-specific binaries you can just scp and run. One binary, no installation, starts in under a second.

What I Gave Up

The Python ACC had a background memory sync loop that polled claude-memory every 15 seconds and synced task status back into the local SQLite database. It was how the dispatcher could update task state asynchronously. The Go platform doesn't have this yet — task state is updated synchronously via the API. Losing the bidirectional memory sync is the biggest functional gap right now.

The homelab-ops Gitea webhook handler also hasn't been ported yet. The platform routes tasks to agents, but the webhook logic that maps changed file paths to specific Ansible playbooks still lives in the old Python service. That's the next thing to migrate.

Lessons

The rewrite forced me to think more carefully about what the coordination layer actually needed to do. The Python ACC had grown organically — a memory sync loop added here, import endpoints added there, middleware stacked on top of middleware. Starting from scratch in Go meant making deliberate choices about the data model and API surface before writing any handlers. The Go version has fewer endpoints and does less, but what it does is clearer.

The embedded UI pattern is underrated. I'd been shipping web UIs as separate deployments for so long that it didn't occur to me to just include them in the binary. For internal tools where you control the build pipeline, it's a significant reduction in operational complexity.

And the boring conclusion: fewer moving parts is better. Four containers, each with their own build context and base image, meant four things that could break and four sets of logs to dig through. One platform binary with structured JSON logs and a single /health endpoint is a lot easier to operate.

What's Next

The Gitea webhook handler is the obvious gap — I want the platform to receive pushes and route them to an ansible-runner agent rather than keeping homelab-ops alive just for that. After that: porting the memory sync loop so the terminal agent can update task state asynchronously the way it did before, and wiring the platform's /metrics endpoint into the existing Grafana dashboards. The endpoint already exists, it just needs a scrape target added to Prometheus config.

One binary, one image, one place to look when something breaks. Worth it.