The Motivation

In my last article I talked about the hybrid approach I use for AI in my homelab: local models for high-volume automated tasks, hosted models for complex reasoning. Toward the end I mentioned I was thinking about fine-tuning an open-weights model on my homelab-specific data. This is that project.

The problem with a general-purpose 3B model is that it doesn't know anything about my infrastructure. When the repair agent gets an alert about ChromaDB, Llama 3.2 gives me generic Docker troubleshooting advice. It doesn't know that ChromaDB on my setup runs on port 8100, stores data in a specific volume, and has a known memory usage pattern after extended uptime. It doesn't know which containers run on which hosts, what ports things use, what my repair history looks like, or how my services are wired together.

The goal is a specialist: a model that knows my specific setup inside and out and routes infra-specific tasks away from expensive general-purpose Claude calls. The plan is to fine-tune Qwen2.5-Coder on my actual homelab data — Ansible playbooks, task logs, bash scripts, skill definitions, and real repair history. The fine-tuned model runs locally on homelab-ai and handles the high-volume, repetitive infrastructure queries without touching the hosted API.

The Hardware

The training runs on homelab-ai (192.168.0.125, Ubuntu 24.04), which has an NVIDIA RTX 5060 Ti with 16GB VRAM. This is consumer-grade hardware — nothing exotic, just a gaming card I put in my AI server. The whole premise of this project is that you don't need a datacenter to get useful fine-tuned models for your own data.

VRAM budget matters more than raw GPU specs. With Ollama stopped and the GPU fully available, I have about 14.7GB free. That fits a 14B model with QLoRA (4-bit quantization) at around 13–14GB peak. That's the practical ceiling for this card — anything larger and I'd need to reduce the batch size to the point where training becomes impractically slow.

For inference after training, the 7B size is actually a better fit — it runs at 87 tok/s on this hardware and uses only 4.8GB VRAM, leaving room for the rest of the homelab services. More on the model size decision in Part 3 when I get to the actual training run.

Phase 1: The Training Environment

Setting up the training stack took a couple of sessions of troubleshooting before I had a configuration that actually worked. Here's what landed.

CUDA and Drivers

CUDA 12.6, with nvcc working. The RTX 5060 Ti is SM 12.0 (Blackwell generation), which becomes relevant in a moment.

Python Stack

Python 3.12, with these specific versions that actually work together:

  • unsloth 2026.4.2 — the training framework. Makes QLoRA fast on consumer hardware by fusing kernels and reducing memory overhead.
  • torch 2.10.0+cu128 — PyTorch with CUDA 12.8 support.
  • bitsandbytes 0.49.2 — for 4-bit quantization (nf4 format). This is what makes 14B models fit in 16GB VRAM.
  • trl 0.24.0 — Hugging Face's trainer library. Handles the training loop, evaluation, and checkpointing.
  • peft 0.18.1 — the LoRA adapter library. Training only the adapter weights instead of all 7B parameters is what makes this feasible on a single consumer GPU.

The version pinning matters. The ecosystem moves fast and incompatible combinations are common — I hit a evaluation_strategyeval_strategy rename in transformers that broke things silently until I checked the logs.

The Flash-Attn Problem

This is the one that ate most of my troubleshooting time. The RTX 5060 Ti is SM 12.0 (Blackwell), and flash-attn doesn't have a precompiled kernel for that architecture yet. You can compile from source, but it takes 30+ minutes and still has edge cases.

The fix: disable flash-attn and fall back to xformers via PyTorch's Scaled Dot-Product Attention (SDPA). Set attn_implementation="sdpa" in the model config and disable the flash-attn env var. You lose maybe 5% throughput but everything works reliably. If you're on a newer GPU — anything Blackwell generation — you'll likely hit the same issue.

Running Alongside Ollama

I already use Ollama constantly for agent work and can't just reserve the whole GPU for training. The solution is crude but effective: train-start.sh stops the Ollama systemd service before training begins, freeing the VRAM it holds, and train-stop.sh restarts it when training finishes. The dispatcher that routes tasks to Ollama knows to pause while training is running.

This means training is disruptive — any local AI tasks queue up while it's running. For a 30-minute training run that's acceptable. For a multi-hour run I'd need something smarter, but for now it works.

Directory Structure

Everything lives under ~/homelab-llm/ on homelab-ai:

  • scripts/ — extraction and generation scripts
  • dataset/ — raw JSONL files from extraction, synthetic generation output, and the final formatted training set
  • training/ — training config, the train script, and the run wrapper

Keeping everything under one directory means it's easy to tear down and rebuild, and easy to inspect the intermediate artifacts when something goes wrong.

What's Coming Next

This is the boring part — just making sure the environment works before touching any data. The interesting work starts in Part 2, where I extract training data from my actual homelab repo. I've got Python scripts that read Ansible playbooks, skill YAML files, and bash scripts and convert them into training examples in Qwen's ChatML tool-call format. The goal is 130+ examples derived directly from the authoritative ground truth of the repo itself.

Part 3 covers the actual training run — how long it took, what the loss curve looked like, and the first test of whether the fine-tuned model actually knows anything useful about my homelab. There's an interesting failure mode in the first run I'll dig into there.

Why Bother?

Claude (via my Max plan) is excellent for complex reasoning and multi-step agent work. But for the high-volume, repetitive stuff — "which container runs on which host," "what port does Grafana use," "run the health check on homelab-ai" — I don't need a general-purpose frontier model. I need something that knows my setup inside and out and runs in 4GB of VRAM.

The hybrid approach from the last article applies here too: use local models for high-volume simple tasks, hosted models for complex reasoning. Fine-tuning just makes the local half actually useful for my specific environment. A general model gives generic advice; a fine-tuned model that's seen my Ansible playbooks, my repair logs, and my service topology gives advice that's actually actionable for my setup.

And this is exactly what open-weights models make possible that hosted APIs don't. You can't fine-tune Claude or GPT-4 on your private infrastructure data and run the result locally. With Qwen and Ollama, you can.