RAG solutions seem to have their limitations, and fine-tuning might be a more effective approach.
How much effort is required to turn code into something one can use for fine-tuning?
Fine tuning against in-house code seems like a small gain over a base model and search. It’s unlikely your code is unique and special and big enough that it’s hard to get results from a base model. You’ll be pinned to a certain version of a certain model, and you won’t be able to upgrade to future models nearly as quickly. Of course, you’re also fighting time again on each commit changing the code unless you continually fine tune it.
A RAG model might still struggle with a super vague question like “where does the foo cal bar with bax set” but it’s unlikely that this would work for fine tuning as well. This is where static code search by symbols really should be used.
Tbh the hardest part is the lifecycle - ie new data, updating, serving etc - that seems to be the biggest issue
I've done it, 1/2 the team though it was great 20% of the time, 1/2 the team hated it from day 0. I used roughly 500K lines of code.
> How much effort is required to turn code into something one can use for fine-tuning?
Very little to moderate, less than 200 lines of python, QWEM FIM, HF, LLAMA.CPP, LLAMA.CPP code extension.
> RAG solutions seem to have their limitations, and fine-tuning might be a more effective approach.
The only problem either way is keeping the information up to date, RAG just adds more cost to the inference process (which at my dev speed is pretty important).
> How much effort is required to turn code into something one can use for fine-tuning?
Fine tuning "fill in the middle" process is the process of taking a file, cutting out a some text in the middle and asking AI to guess what was there - there is a hugging face example that will have you doing it in an hour or less - your OPs team saying "No you cant litreally copy all code to a single folder" is probably the biggest hurdle (advise them you'll do it in CI and then they can stand up a FIM training endpoint that accepts a csv, pretty easy)
I know its coming but "mUlTi GpU PlZ" :pleading: <3
Is it just a matter of assembling Q/A pairs like: “What’s class X?”, “class X { … }”
Do you really need to do this training on the base model instead, which means you have to fine tune chat on it afterward?
How does this work?
You will generally get better results when you fine-tune the base model on your data.
Since you still want to use it with the chat template in the end, you fine-tune the base model with the chat template with your specific data.
From there you'll have a lora that knows your data alright, but still doesn't really work for chatting.
You take that lora, merge it with the base model. Let's call this the stage model.
Then you use mergekit to merge the base model with both the stage model and the chat model. I used the TIES merge method in the past. Now you have your final model.
I use vLLM for inference, and needed access to multiple fine tunes on only a single set of hardware. So from that point I go and take the base model and my final model and extract a new lora. I also take the base model and chat model and extract another lora for that. Then I load up vLLM with the base model and as many of the fine tune loras I need + the chat lora.
The only time this hasn't worked is if the chat model adds a bunch of new tokens on top of the base model. If I remember right there was an issue with that
This has worked well for me in the past.
A lot of financial, legal and health companies do fine-tuning! Reasoning finetuning via GRPO is also very powerful since you don't need any cot data in between! Just inputs and outputs!
If you have a Mac, you can also do pretty well training LORA adapters using something like Llama-Factory, and allowing it to run overnight. It's slower than an NVIDIA GPU but the increased effective memory size (if you say have 128GB) can allow you more flexibility.
A 'LoRA' is a memory-efficient type of fine tuning that only tunes a small fraction of the LLM's parameters. And 'quantisation' reduces an LLM to, say, 4 bits per parameter. So it's feasible to fine-tune a 7B parameter model at home.
Anything bigger than 7B parameters and you'll want to look at renting GPUs on a platform like Runpod. In the current market, there are used 4090s selling on ebay right now for $2100 while runpod will rent you a 4090 for $0.34/hr - you do the math.
It's certainly possible to scale model training to span multiple nodes, but generally scaling through bigger GPUs and more GPUs per machine is easier.
But if its helpful I was thinking about spinning up a platform for something like that!
It'd become a lot less practical with huge datasets, but I'd guess that a lot of fine tuning tasks aren't really that large.
On paper fine tuning smaller models can greatly reduce the cost for a specific task, but I've not heard many real-world success stories around that.
I think vision LLMs are one of the most interesting applications here - things like fine-tuning for better results extracting data from a specific paper form or report structure. Again, not many public examples of that.
1. Codebases, docs, large corpses of internal datasets - fill in the middle, auto completion etc.
2. I know a tonne of financial institutions use fine-tuning for trading, real time data parsing headline analysis, signal creation etc
3. Distillation is also relatively common - taking outputs of a large model and distilling it to a small model
4. Accuracy increasing is the most important - not cost or latency - we find if you solve the finetuning life cycle ie continuous auto fine-tuning, data filtering, reinforcement learning via DPO, that works well!
5. Lots of organizations use DPO and preference fine-tuning to align models since they have tonnes of feedback data!
6. Yep vision fine-tuning! For eg medical diagnosis, docs, qa on pics etc
7. And obviously large model labs finetune all base models ie chatgpt4.5 is a finetune of a base model
8. Finally reasoning finetuning via GRPO is very cool! If you have inputs and outputs but no labelled cot in between, GRPO is the way to go! Custom reward functions by companies!
I still haven't seen a convincing demo of using fine-tuning to "teach" a model new information from additional documents. I'd love to see one.
(Closest I've come to that is I heard a rumor that Jane Street have fine-tuned an LLM for OCaml)
https://huggingface.co/TrevorJS/check-amount-deverbalizer-sm...
At Avy.ai we're running small (2B-7B, quantized) vision models as part of a Mac desktop application for understanding what someone is working on in the moment, to offer them related information and actions.
We found that the raw results in understanding the images with a light LORA fine tune are not substantially different -- but the ease of getting a small model to follow instructions in outputting structured data in response to the image and at the level of verbosity and detail we need is greatly enhanced with fine tuning. Without fine tuning the models on the smaller end of that scale would be much more difficult to use, not reliably producing output that matched what the consuming application expects
The bigger thing though was getting the models to have the appropriate levels of verbosity and detail in their ouput which fine tuning made more consistent.
Could be a useful marketing strategy for you, given how starved we all are of information about successful fine tuning stories.
I got to present at GCP Next about a part of this last year: https://www.youtube.com/watch?v=5QsM1K9ahtw
I’m presenting in one (and maybe two) sessions with more info on the training side this year.
In practice prompt engineering and few-shot prompting with modern LLMs, due to their strong-and-only-getting-better-over-time prompt adherence, tends to be more pragmatic.
When it’s more feasible to do inference on the client (browser or desktop) I can see SLMs popping up more common in production.
It's not actually that expensive and hard. For narrow usecases, you can produce 4-bit quantized fine-tunes that perform as well as the full model. Hosting the 4-bit quantized version can be done on relatively low cost. You can use A40 or RTX 3090 on Runpod for ~$300/month.
If you want to scale up and down on demand, you can just fine tune on openai and google cloud as well.
I don't think that's true.
I can fine tune a model by renting a few A100s for a few hours, total cost in the double digit dollars. It's a one-time cost.
Running inference with the resulting model for a production application could cost single digit dollars per hour, which adds up to hundreds or even thousands of dollars a month on an ongoing basis.
That may or may not be true for use-cases that require asynchronous, bulk inference _and_ require some task-specific post-training.
FWIW, my approach towards tasks like the above is to
1. start with using an off-the-shelf LM API until
2. one figures out (using evals that capture product intent) what the failure modes are (there always are some) and then
3. post-train against those (using the evals)
So, unless you hope to stay at the fore front (e.g. to be ahead of competitors), there has been no real reason to finetune for the last 4 years, at best you could hope to stay about 1-3 months ahead, depending on how fast you were at setting up your training. And if that is what you did hope to achieve, you needed to automate on a higher level, i.e. automate data collection and the collection of eval cases.
Arabic OCR is a mess with historical texts. Take the word الف (alf/thousand) in dates like 1950 - in old documents, the ف (fa) had a dot below it, but modern OCR doesn't get this and outputs الد (alad), which is just gibberish in Arabic
Same problem with ق (qaf) written as ف (fa) in old Arabic
And don't get me started on merged letters! In محمد (Muhammad), sometimes the م (meem) sits right on top of the ح (haa), or appears as a little circle below the line. Modern OCR has no clue what to do with these
My solution? Run OCR first, then use LLMs to fix the mess based on context. The surprising part? In my tinkering, smaller fine-tuned models actually do BETTER at this specific task than the big general-purpose ones. They seem to learn the patterns of historical Arabic quirks more effectively. Pretty neat tradeoff of specialized knowledge vs. general intelligence
Both Phi-4-mini and Gemma 3 were released recently. Phi-4's damn close to a good, real, model release. Microsoft's done a great job of iterating.
Gemma 3's an excellent, intelligent, model, but it's got a gaping blind spot: tool-calling / JSON output. There was a vague quick handwave about it in some PR, a PM/eng on the Gemma team commented here in response to someone else that TL;DR "it's supported in Ollama!", which is Not Even Wrong, i.e. in the Pauli sense of the phrase.
- Ollama uses a weak, out of date llama.cpp thing where the output tokens are constrained to match a JSON schema. This falls apart almost immediately, i.e. as soon as there is more than one tool.
- The thing that matters isn't whether we can constrain output tokens, any model can do that, I've had Llama 3 1B making tool calls that way. The thing that matters is A) did you train that in and B) if you did, tell us the format
All that to say, IMHO we're still 6 months to a year out from BigCo understanding enough about their own stuff to even have a good base for it. Sure, tool calling and fine-tuning are orthogonal, in a sense, but in practice, if I'm interested in getting a specific type of output, odds are I wanted that formatted a specific way.
It can't understand numbers very well though, "one thousand five" might become "1500".
JSON constraints seem to make them unable to figure it out even if they'd normally get it every time.
Maybe it's different with models above 4B though.
found this on HF https://huggingface.co/ZySec-AI/gemma-3-27b-tools
we are optimizing these on different dimensions at once, and multiple branches of evolution from each model
so a successor version name doesn't really convey that
I'm particularly interested in this aspect because we're considering fine-tuning Gemma 3, but our budget is tight. We're looking into (real-world) cost estimates for this approach.
My understanding is that they don't charge these by themselves although you might have to pay Colab fee to Google.
They charge higher end models it seems https://unsloth.ai/pricing
If yes, what they are good and bad at?