Apr 16, 2026 aihomelabpython

Fine-Tuning an Open Weights Model for My Homelab, Part 2: The Training Loop

Building an automated training harness that extracts data from my homelab, generates synthetic examples with Claude, trains overnight, and evaluates itself — getting to 62% accuracy.

Recap: Where Part 1 Left Off

Part 1 covered setting up the training environment on the RTX 5060 Ti. Now comes the interesting part: building a dataset from my actual homelab, training the model, and iterating on quality automatically.

The Dataset: Mining My Own Infrastructure

I wrote Python scripts that read my actual homelab repo and convert everything into training examples. Four extraction scripts pull from Ansible playbooks (12 playbooks, 21 roles), AL-1S worker skill YAML files (10 skills), bash scripts (30+ files), and architecture documentation. The output is OpenAI function-calling format JSONL — the same format the model will use in production.

After the min-tokens fix (more on that below), the final dataset came in at 1,105 examples, broken down as: 180 architecture, 120 operational, 111 tool_calls, and the remainder spread across other categories. The distribution is still skewed, but far better than it started.

The First Training Run

Training Qwen2.5-Coder-7B-Instruct with QLoRA took about 30 minutes per iteration. Training loss couldn't be parsed from the output — it was showing as "not captured" in the harness logs, a logging issue I still need to chase down. Speed was solid at 87 tokens/second.

The early evaluation results were brutal: 2.8% overall. Architecture: 5%. Operational: 5%. Tool-calls: 0%. The model had learned essentially nothing useful about my homelab. Asked "What happened to homelab-core?" and it made up task names and wrong IPs instead of knowing that homelab-core died and services migrated to homelab-ai.

Root cause was obvious in hindsight: almost no architecture or operational examples were making it into training. At 7B parameters, factual knowledge needs repetition to stick — but first the examples have to actually reach the model.

Building the Training Harness

Rather than manually iterate, I built an automated training harness that runs the full cycle: extract data → analyze dataset balance → generate synthetic examples for weak categories → format → train → evaluate → decide if another iteration is needed.

The harness ran for 5 iterations over 6 hours 35 minutes total, with each iteration taking roughly 1–1.5 hours. It runs up to 5 iterations before stopping and generating a report. The evaluation suite has 20 test prompts across three categories: architecture knowledge (IPs, ports, what runs where), tool-call format (producing valid function calls), and operational knowledge (how to deploy, what the wake word is, how tasks are tracked).

The Min-Tokens Bug

The 2.8% score made no sense given the volume of synthetic examples being generated. The architecture score stayed near zero across early runs no matter how many examples I fed it.

Digging in, I found the problem: format_dataset.py had a 200-token minimum filter. My architecture Q&A examples were short — "What IP does homelab-ai run on?" / "192.168.0.125" — maybe 50–80 tokens each. The filter was silently dropping every single architecture example. Over 1,000 generated examples were being thrown away before training ever saw them.

Changed the minimum to 50 tokens. Iteration 1 after the fix immediately jumped to 59.3% overall — a 56-point gain from fixing one constant in a filter.

The Generation Quality Problem

Even with the filter fixed, the harness logs showed "+0 examples" added each iteration — the priority categories fix that would direct generation toward architecture, tool_calls, and operational hadn't been deployed yet for this run. The generation step was running but producing examples in categories the evaluator doesn't test, so the scores plateaued.

The synthetic data was also being generated by Llama 3.2 (3B) — the same class of model I was trying to train. A 3B model generating training data for a 7B model is the blind leading the blind. The examples were vague, sometimes wrong, and not detailed enough.

The fix: route synthetic generation through Claude via my terminal agent (the Raspberry Pi running the Claude Agent SDK). Claude produces dramatically better training examples — real SSH commands with correct IPs, multi-step deployment instructions, accurate port numbers, proper tool-call schemas.

Compare:

Llama 3.2: "The IP address of the Mosquitto MQTT port is 192.168.0.120:1883. TASK_STATUS: COMPLETED"
Claude: Full SSH commands, docker compose paths, multi-step verification, actual operational context

The difference in quality is stark. Llama gives you a sentence. Claude gives you a runbook.

Category Alignment

The "+0 examples" problem traced back to the harness generating synthetic data for whatever categories showed up in the dataset stats — including random categories like "management" and "accessibility" that the evaluator doesn't even test. Wasted GPU time training on data that doesn't improve the scores.

The fix is to define priority categories that match what the evaluator actually tests: architecture, tool_calls, operational, plus core skills like ansible, health_monitoring, ssh_diagnostics, and deployment. This fix wasn't deployed for the 5-iteration run captured in the training report, which is why the generation step showed no net gains per iteration despite running.

The Results: 62.3%

The best overall score across all 5 iterations was 62.3%, reached at iteration 4. The pass threshold is 70%, so the run ended without a passing grade — but the progression across iterations tells the real story.

Architecture was the big win: it went from 5% in the pre-fix runs, up to 20% at iteration 1, then 80.7% at iteration 3, and peaked at 86.3% by iteration 4. The min-tokens fix unlocked architecture learning almost entirely on its own.

Operational knowledge held steady at 63.3% across iterations — not improving, but not regressing either. It had enough examples to learn from.

Tool-calls is the clear bottleneck: it stayed stuck at 33.3% across all 5 iterations. The model learned the general shape of a function call but couldn't reliably produce the correct schema, argument names, or values for my specific tool definitions. No amount of additional training moved this number.

What's Next

62.3% isn't production-ready, and the 70% threshold exists for a reason — below that, the model makes enough mistakes that routing real infra queries to it would cause more problems than it solves. But the pipeline is validated and the bottleneck is identified.

Part 3 will focus entirely on cracking the tool_calls problem: 33.3% → 70%+. The plan is a dedicated tool-call prompt template that trains the model on the exact function signatures it will encounter in production, plus Claude-generated function-calling examples that include realistic argument values, multi-step tool chains, and error-recovery patterns. The architecture and operational scores are already in good shape — tool_calls is the one thing standing between this model and something genuinely useful.

The training harness runs overnight autonomously. I can sleep while my homelab teaches itself.