Mar 19, 2026 homelabaidevops

Building a Self-Healing Homelab with AI

How I built an autonomous repair system that monitors my homelab services and automatically fixes them when they go down.

The Problem

Running a homelab is great until something breaks at 2 AM. Whether it's a container that crashed, a service that ran out of memory, or just the usual Docker gremlins, I was tired of waking up to notifications about downed services and having to manually SSH in to restart things.

What if my homelab could fix itself?

The Solution: AI-Powered Self-Healing

I built a pipeline that combines monitoring, AI analysis, and automated repair actions:

Service Down → Uptime Kuma → n8n → Ollama → Repair Agent → Service Restored

Here's how each piece works:

1. Uptime Kuma - The Watchdog

Uptime Kuma monitors all my services - Grafana, Prometheus, Home Assistant, Jellyfin, and more. When something goes down, it fires a webhook to n8n.

2. n8n - The Orchestrator

n8n receives the alert and does two things:

Sends the error to Ollama for AI analysis
Forwards everything to my custom Repair Agent

3. Ollama - The Brain

Running locally on my AI server, Ollama (with Llama 3.2) analyzes the error and provides troubleshooting suggestions. These get sent to Home Assistant as notifications so I can see what's happening.

4. Repair Agent - The Hands

This is where the magic happens. I built a FastAPI service that:

Receives alerts with the AI analysis
Asks Ollama to pick from a whitelist of safe actions
Executes the repair via SSH
Logs everything for audit purposes

The Repair Agent

The agent has a strict whitelist of allowed actions:

ALLOWED_ACTIONS = {
    "restart_container": "docker restart {container_name}",
    "check_container_logs": "docker logs --tail 50 {container_name}",
    "check_disk_space": "df -h",
    "check_memory": "free -h",
    "check_docker_status": "docker ps",
    "prune_docker_cache": "docker system prune -f",
    "restart_docker_compose": "cd {compose_dir} && docker compose restart",
}

It knows which container runs on which server and can SSH to any of my three machines (core, media, AI) to execute repairs.

The key insight: restart fixes 90% of issues. So I tuned the AI prompt to prefer action over diagnosis:

IMPORTANT GUIDELINES:
1. PREFER restart_container as first action - it fixes 90% of issues
2. Only use check_container_logs if error suggests config/code issue
3. The goal is to RESTORE SERVICE, not just diagnose

Real-World Test

Here's an actual repair from my logs:

{
  "timestamp": "2026-01-01T03:02:25",
  "action": "decision",
  "details": {
    "monitor": "Home Assistant",
    "action": "restart_container",
    "reason": "Restarting container with low risk for most service failures"
  }
}

Home Assistant went down — the container had crashed due to a failed integration update. Within 2 minutes:

Uptime Kuma detected the outage
n8n orchestrated the workflow
Ollama decided to restart
Repair Agent executed docker restart homeassistant
Service restored automatically

I didn't have to lift a finger.

The Stack

Monitoring: Uptime Kuma
Orchestration: n8n (self-hosted)
AI: Ollama with Llama 3.2 (running on local GPU)
Repair Agent: Python/FastAPI with asyncssh
Notifications: Home Assistant persistent notifications
Infrastructure: Docker Compose across 3 Ubuntu servers

Lessons Learned

Start with guardrails: The whitelist approach prevents the AI from doing anything destructive. It can only pick from safe, pre-approved actions.
Audit everything: Every decision and execution is logged. I can review what the agent did and why.
Prefer action over analysis: Initially the AI was too conservative, always checking logs first. Tuning the prompt to prefer restarts made it actually useful.
Test with real failures: I literally stopped containers to test the pipeline. Simulated failures caught issues that unit tests never would.

What's Next

Add more repair actions (clear specific caches, rotate logs)
Implement escalation (try restart, then check logs, then notify human)
Track success rates to improve the AI's decision making
Maybe let it handle disk space issues automatically

Code

The full implementation is in my homelab repo. The repair agent is about 400 lines of Python - nothing fancy, just FastAPI + asyncssh + httpx for the Ollama calls.

Running a homelab shouldn't mean being on-call 24/7. With a bit of AI and automation, your infrastructure can take care of itself.