The Problem
Running a homelab is great until something breaks at 2 AM. Whether it's a container that crashed, a service that ran out of memory, or just the usual Docker gremlins, I was tired of waking up to notifications about downed services and having to manually SSH in to restart things.
What if my homelab could fix itself?
The Solution: AI-Powered Self-Healing
I built a pipeline that combines monitoring, AI analysis, and automated repair actions:
Service Down → Uptime Kuma → n8n → Ollama → Repair Agent → Service Restored
Here's how each piece works:
1. Uptime Kuma - The Watchdog
Uptime Kuma monitors all my services - Grafana, Prometheus, Home Assistant, Jellyfin, and more. When something goes down, it fires a webhook to n8n.
2. n8n - The Orchestrator
n8n receives the alert and does two things:
- Sends the error to Ollama for AI analysis
- Forwards everything to my custom Repair Agent
3. Ollama - The Brain
Running locally on my AI server, Ollama (with Llama 3.2) analyzes the error and provides troubleshooting suggestions. These get sent to Home Assistant as notifications so I can see what's happening.
4. Repair Agent - The Hands
This is where the magic happens. I built a FastAPI service that:
- Receives alerts with the AI analysis
- Asks Ollama to pick from a whitelist of safe actions
- Executes the repair via SSH
- Logs everything for audit purposes
The Repair Agent
The agent has a strict whitelist of allowed actions:
ALLOWED_ACTIONS = {
"restart_container": "docker restart {container_name}",
"check_container_logs": "docker logs --tail 50 {container_name}",
"check_disk_space": "df -h",
"check_memory": "free -h",
"check_docker_status": "docker ps",
"prune_docker_cache": "docker system prune -f",
"restart_docker_compose": "cd {compose_dir} && docker compose restart",
}
It knows which container runs on which server and can SSH to any of my three machines (core, media, AI) to execute repairs.
The key insight: restart fixes 90% of issues. So I tuned the AI prompt to prefer action over diagnosis:
IMPORTANT GUIDELINES:
1. PREFER restart_container as first action - it fixes 90% of issues
2. Only use check_container_logs if error suggests config/code issue
3. The goal is to RESTORE SERVICE, not just diagnose
Real-World Test
Here's an actual repair from my logs:
{
"timestamp": "2026-01-01T03:02:25",
"action": "decision",
"details": {
"monitor": "Portainer",
"action": "restart_container",
"reason": "Restarting container with low risk for most service failures"
}
}
Portainer went down. Within 2 minutes:
- Uptime Kuma detected the outage
- n8n orchestrated the workflow
- Ollama decided to restart
- Repair Agent executed
docker restart portainer - Service restored automatically
I didn't have to lift a finger.
The Stack
- Monitoring: Uptime Kuma
- Orchestration: n8n (self-hosted)
- AI: Ollama with Llama 3.2 (running on local GPU)
- Repair Agent: Python/FastAPI with asyncssh
- Notifications: Home Assistant persistent notifications
- Infrastructure: Docker Compose across 3 Ubuntu servers
Lessons Learned
- Start with guardrails: The whitelist approach prevents the AI from doing anything destructive. It can only pick from safe, pre-approved actions.
- Audit everything: Every decision and execution is logged. I can review what the agent did and why.
- Prefer action over analysis: Initially the AI was too conservative, always checking logs first. Tuning the prompt to prefer restarts made it actually useful.
- Test with real failures: I literally stopped containers to test the pipeline. Simulated failures caught issues that unit tests never would.
What's Next
- Add more repair actions (clear specific caches, rotate logs)
- Implement escalation (try restart, then check logs, then notify human)
- Track success rates to improve the AI's decision making
- Maybe let it handle disk space issues automatically
Code
The full implementation is in my homelab repo. The repair agent is about 400 lines of Python - nothing fancy, just FastAPI + asyncssh + httpx for the Ollama calls.
Running a homelab shouldn't mean being on-call 24/7. With a bit of AI and automation, your infrastructure can take care of itself.