Apr 2, 2026 aiops

Local vs Hosted AI: When to Run Your Own Models

Running Ollama on a homelab GPU changed how I think about which AI tasks need cloud inference and which work better locally.

Why Run AI Locally?

When people hear "self-hosted AI," they often think it's about replacing ChatGPT or Claude. It's not — at least not for me. Hosted models are better at complex reasoning, long-form writing, and tasks that need massive context windows. But there's a whole category of AI work where local models aren't just "good enough" — they're actually better suited.

I run Ollama on my homelab with an RTX 5060 Ti (16GB VRAM), and it handles a surprising amount of my day-to-day AI workloads. Here's what I've learned about when to run locally vs when to reach for a hosted API.

My Local Setup

Ollama runs as a native systemd service on my AI server, exposed on port 11434. It currently runs four models:

Llama 3.2 (3B) — The workhorse. Handles 90% of my automated tasks.
Qwen3 8B — My default chat model for interactive use via the terminal.
CodeLlama — Available for code-specific tasks, though I use it less than I expected.
nomic-embed-text — Powers the embedding pipeline for my memory system's vector search.

Everything connects through three paths: direct LAN access for services on the same network, Docker bridge for containerized services, and Traefik HTTPS for external access through my reverse proxy.

What Works Great Locally

Agent Tasks with Simple Tool Calls

My repair agent uses Llama 3.2 to decide how to fix crashed services. It receives an alert, picks from a whitelist of safe actions (restart container, check logs, prune Docker cache), and executes the fix via SSH. This is a perfect local model task because the decision space is constrained — the model just needs to pick the right action from a short list based on the error context.

The response time is fast (a few seconds), there's no API cost per call, and it runs 24/7 without worrying about rate limits. When your repair agent fires at 3 AM because Portainer crashed, you don't want it waiting on a rate-limited API.

Workflow Automation

n8n triggers Ollama constantly. I have workflows for AI-powered incident analysis, Docker health monitoring, email classification, and daily health report generation. These are all small, focused tasks — summarize this error, classify this email, generate a brief report. A 3B model handles them perfectly.

The key insight: these tasks have narrow scope and clear structure. You're not asking the model to reason about abstract problems — you're asking it to process specific data in a predictable way.

Embeddings

My entire memory system runs on nomic-embed-text for vector embeddings. Every memory stored, every semantic search query — all processed locally. Embeddings are the easiest win for local AI because the models are tiny, fast, and the quality difference between local and hosted embedding models is negligible for most use cases.

Voice Assistant

The Wyoming voice pipeline uses Ollama as an optional conversation agent for my AL-1S voice assistant. For simple home automation commands ("turn off the office lights," "what's the temperature"), a local model responds faster than a round trip to a cloud API.

Where Hosted Models Still Win

Complex Agentic Work

When I need Claude Code to refactor a module, design an architecture, or debug a subtle issue across multiple files, local models can't compete. Tasks that require long chains of reasoning, maintaining context across many steps, or understanding nuanced intent — these need the bigger models.

The difference is stark: Llama 3.2 can pick "restart container" from a list. It cannot reliably plan a multi-step infrastructure migration.

Tool Use at Scale

My local models handle simple tool calls well — one or two tools with clear schemas. But complex tool use with many available tools, where the model needs to figure out which combination to chain together? That's where hosted models with better instruction following shine.

I've tried running agentic loops with Llama 3.2 where it has access to multiple MCP tools. It works sometimes, but the reliability drops compared to hosted models. The failure mode is usually the model calling the wrong tool or malforming the arguments.

Content Creation

Writing blog posts, documentation, or anything that needs to sound natural and well-structured? Hosted models, every time. Local models can summarize and classify, but they struggle with the kind of coherent long-form output that reads well.

The Cost Equation

Here's the real math: my Ollama setup processes thousands of requests per day across all my automation workflows. If each of those hit a hosted API, the costs would add up fast — even at pennies per request.

But the GPU cost money too. The 5060 Ti was around $400, and it runs 24/7 drawing power. The breakeven point depends on your volume, but for my use case (high volume, simple tasks, always-on), local wins easily.

The hybrid approach makes the most sense: local for high-volume automated tasks, hosted for complex reasoning when you need it.

Plans: Fine-Tuning for Homelab

Here's what I'm thinking about next. A general-purpose 3B model works fine for most tasks, but it doesn't know anything about MY specific infrastructure. When the repair agent gets an alert about "chromadb," Llama 3.2 gives generic Docker troubleshooting advice. A fine-tuned model could know that ChromaDB on my setup runs on port 8100, stores data in a specific volume, and has a known issue with memory usage after 72 hours.

The plan is to fine-tune an open-weights model on my homelab-specific data:

Historical repair logs (what actions fixed what problems)
Service configurations and relationships
Common failure patterns and their actual resolutions
My infrastructure topology (which services run where)

The goal isn't to replace the general model — it's to create a specialist that's better at the specific tasks my homelab needs. A model that knows "when Jellyfin goes down on the media server, check the transcode cache first" because that's what the repair history shows.

This is where open-weights models shine over hosted APIs. You can't fine-tune Claude or GPT on your infrastructure data and run it locally. With Ollama and something like Llama or Qwen, you can.

I'm still figuring out the right approach — whether to use LoRA adapters, full fine-tuning on a small model, or just better prompt engineering with RAG. But the direction is clear: the more your AI knows about your specific environment, the more useful it becomes.

Getting Started

If you want to add local AI to your homelab:

Start with Ollama — it's the easiest way to run models locally. One install, one command to pull a model.
Pick the right model for the job — don't default to the biggest model. A 3B model is often perfect for automation tasks.
Use hosted models for what they're good at — complex reasoning, content creation, multi-step planning.
Measure before optimizing — track what tasks your AI handles and where it fails. That tells you where to invest in better models or fine-tuning.
Keep it hybrid — the best setup uses both local and hosted models, each for what they do best.

The point isn't "local vs hosted" — it's knowing which tool fits which job.