When fine-tuning makes sense
- Repeated prompts exceed context windows or waste tokens.
- You need consistent formatting across hundreds of files.
- You can supply thousands of high-quality supervised examples.
If occasional prompts solve the task, prefer skills (MCP & skills) or structured instructions first—they are cheaper to iterate.
LoRA and QLoRA
Low-rank adapters (LoRA) train small matrices attached to frozen base weights—far less VRAM than full fine-tunes. QLoRA pushes footprints lower by quantizing the base model during training. Both require curated datasets and honest evaluation harnesses.
Dataset hygiene
- Strip secrets before logging or sharing samples.
- Balance successes vs failure handling so the model learns safe refusals.
- Version datasets like code—fine-tunes inherit whatever biases you encode.
Evaluation
Hold out real-world tasks, measure regressions on general coding ability, and compare against the base model with identical prompts. Automatic metrics help, but human review on tricky tickets remains essential.
Serving adapters
After training, export artifacts in a format your inference stack understands (PEFT adapters merged into weights, GGUF conversions, etc.). Host via vLLM, SGLang, or another OpenAI-compatible gateway, then register the endpoint in DeepSeek TUI—same flow as Local deployment.
Compliance
Respect upstream licenses and corporate policies. Some checkpoints permit derivative work only under specific terms—read each card before shipping adapters to production.
Related
- Model families
- API pricing if you hybridize cloud + local.
- Configuration for routing custom endpoints.