The Problem

Running a homelab is great until something breaks at 2 AM. Whether it's a container that crashed, a service that ran out of memory, or just the usual Docker gremlins, I was tired of waking up to notifications about downed services and having to manually SSH in to restart things.

What if my homelab could fix itself?

The Solution: AI-Powered Self-Healing

I built a pipeline that combines monitoring, AI analysis, and automated repair actions:

Service Down → Uptime Kuma → n8n → Ollama → Repair Agent → Service Restored

Here's how each piece works:

1. Uptime Kuma - The Watchdog

Uptime Kuma monitors all my services - Grafana, Prometheus, Home Assistant, Jellyfin, and more. When something goes down, it fires a webhook to n8n.

2. n8n - The Orchestrator

n8n receives the alert and does two things:

  • Sends the error to Ollama for AI analysis
  • Forwards everything to my custom Repair Agent

3. Ollama - The Brain

Running locally on my AI server, Ollama (with Llama 3.2) analyzes the error and provides troubleshooting suggestions. These get sent to Home Assistant as notifications so I can see what's happening.

4. Repair Agent - The Hands

This is where the magic happens. I built a FastAPI service that:

  • Receives alerts with the AI analysis
  • Asks Ollama to pick from a whitelist of safe actions
  • Executes the repair via SSH
  • Logs everything for audit purposes

The Repair Agent

The agent has a strict whitelist of allowed actions:

ALLOWED_ACTIONS = {
    "restart_container": "docker restart {container_name}",
    "check_container_logs": "docker logs --tail 50 {container_name}",
    "check_disk_space": "df -h",
    "check_memory": "free -h",
    "check_docker_status": "docker ps",
    "prune_docker_cache": "docker system prune -f",
    "restart_docker_compose": "cd {compose_dir} && docker compose restart",
}

It knows which container runs on which server and can SSH to any of my three machines (core, media, AI) to execute repairs.

The key insight: restart fixes 90% of issues. So I tuned the AI prompt to prefer action over diagnosis:

IMPORTANT GUIDELINES:
1. PREFER restart_container as first action - it fixes 90% of issues
2. Only use check_container_logs if error suggests config/code issue
3. The goal is to RESTORE SERVICE, not just diagnose

Real-World Test

Here's an actual repair from my logs:

{
  "timestamp": "2026-01-01T03:02:25",
  "action": "decision",
  "details": {
    "monitor": "Portainer",
    "action": "restart_container",
    "reason": "Restarting container with low risk for most service failures"
  }
}

Portainer went down. Within 2 minutes:

  1. Uptime Kuma detected the outage
  2. n8n orchestrated the workflow
  3. Ollama decided to restart
  4. Repair Agent executed docker restart portainer
  5. Service restored automatically

I didn't have to lift a finger.

The Stack

  • Monitoring: Uptime Kuma
  • Orchestration: n8n (self-hosted)
  • AI: Ollama with Llama 3.2 (running on local GPU)
  • Repair Agent: Python/FastAPI with asyncssh
  • Notifications: Home Assistant persistent notifications
  • Infrastructure: Docker Compose across 3 Ubuntu servers

Lessons Learned

  1. Start with guardrails: The whitelist approach prevents the AI from doing anything destructive. It can only pick from safe, pre-approved actions.
  2. Audit everything: Every decision and execution is logged. I can review what the agent did and why.
  3. Prefer action over analysis: Initially the AI was too conservative, always checking logs first. Tuning the prompt to prefer restarts made it actually useful.
  4. Test with real failures: I literally stopped containers to test the pipeline. Simulated failures caught issues that unit tests never would.

What's Next

  • Add more repair actions (clear specific caches, rotate logs)
  • Implement escalation (try restart, then check logs, then notify human)
  • Track success rates to improve the AI's decision making
  • Maybe let it handle disk space issues automatically

Code

The full implementation is in my homelab repo. The repair agent is about 400 lines of Python - nothing fancy, just FastAPI + asyncssh + httpx for the Ollama calls.

Running a homelab shouldn't mean being on-call 24/7. With a bit of AI and automation, your infrastructure can take care of itself.