Building a Self-Healing Homelab with AI

How I built an autonomous repair system that monitors my homelab services and automatically fixes them when they go down.

The Problem

Running a homelab is great until something breaks at 2 AM. Whether it's a container that crashed, a service that ran out of memory, or just the usual Docker gremlins, I was tired of waking up to notifications about downed services and having to manually SSH in to restart things.

What if my homelab could fix itself?

The Solution: AI-Powered Self-Healing

I built a pipeline that combines monitoring, AI analysis, and automated repair actions:

Service Down → Uptime Kuma → n8n → Ollama → Repair Agent → Service Restored

Here's how each piece works:

1. Uptime Kuma - The Watchdog

Uptime Kuma monitors all my services - Grafana, Prometheus, Home Assistant, Jellyfin, and more. When something goes down, it fires a webhook to n8n.

2. n8n - The Orchestrator

n8n receives the alert and does two things:

3. Ollama - The Brain

Running locally on my AI server, Ollama (with Llama 3.2) analyzes the error and provides troubleshooting suggestions. These get sent to Home Assistant as notifications so I can see what's happening.

4. Repair Agent - The Hands

This is where the magic happens. I built a FastAPI service that:

The Repair Agent

The agent has a strict whitelist of allowed actions:

ALLOWED_ACTIONS = {
    "restart_container": "docker restart {container_name}",
    "check_container_logs": "docker logs --tail 50 {container_name}",
    "check_disk_space": "df -h",
    "check_memory": "free -h",
    "check_docker_status": "docker ps",
    "prune_docker_cache": "docker system prune -f",
    "restart_docker_compose": "cd {compose_dir} && docker compose restart",
}

It knows which container runs on which server and can SSH to any of my three machines (core, media, AI) to execute repairs.

The key insight: restart fixes 90% of issues. So I tuned the AI prompt to prefer action over diagnosis:

IMPORTANT GUIDELINES:
1. PREFER restart_container as first action - it fixes 90% of issues
2. Only use check_container_logs if error suggests config/code issue
3. The goal is to RESTORE SERVICE, not just diagnose

Real-World Test

Here's an actual repair from my logs:

{
  "timestamp": "2026-01-01T03:02:25",
  "action": "decision",
  "details": {
    "monitor": "Home Assistant",
    "action": "restart_container",
    "reason": "Restarting container with low risk for most service failures"
  }
}

Home Assistant went down — the container had crashed due to a failed integration update. Within 2 minutes:

  1. Uptime Kuma detected the outage
  2. n8n orchestrated the workflow
  3. Ollama decided to restart
  4. Repair Agent executed docker restart homeassistant
  5. Service restored automatically

I didn't have to lift a finger.

The Stack

Lessons Learned

  1. Start with guardrails: The whitelist approach prevents the AI from doing anything destructive. It can only pick from safe, pre-approved actions.
  2. Audit everything: Every decision and execution is logged. I can review what the agent did and why.
  3. Prefer action over analysis: Initially the AI was too conservative, always checking logs first. Tuning the prompt to prefer restarts made it actually useful.
  4. Test with real failures: I literally stopped containers to test the pipeline. Simulated failures caught issues that unit tests never would.

What's Next

Code

The full implementation is in my homelab repo. The repair agent is about 400 lines of Python - nothing fancy, just FastAPI + asyncssh + httpx for the Ollama calls.

Running a homelab shouldn't mean being on-call 24/7. With a bit of AI and automation, your infrastructure can take care of itself.