To finetune or not to finetune

Teaching Chatbots new tricks: our learnings from Finetuning AI Models.

We’ve been running some experiments recently with finetuning large language models (LLMs). Finetuning is when you further train an existing model on a new dataset to adapt its behaviour for a specific purpose.

We explored different methods (axolotl, TinyLlama, llama.cpp) and different base models (“tinyllama-1.1b-chat”, “llama-2-7b”, “llama-160m”, “Flash-Llama-30M”, “TinyMistral-248M”) to see what works best. Here’s a quick summary of our key learnings:

Chat Models vs Text Models

Text models seem to learn better from a “completion” dataset {"text":""}. On chat models, I’m not sure if the finetuning is having an effect on the responses, as they probably require an “conversation” dataset {"input":"","output": ""} for training.

Small Models Learn Faster

Large models need more data and compute time to finetune compared to smaller ones. Large models are good for general chat, but small models adapt quicker for specialised use cases. The trade off is that small models tend to hallucinate more.

Training Data Length vs Expected Output

Using a large context takes significantly longer to train a model. I think that is only worth it if we are looking to also get long answers from the model. The training dataset items length should have an average length that is about the same as the expected context in future conversations. Items with context longer than the specified during training will be ignored past the limit.

Hardware Considerations

An M1 Max with a 10 core CPU is able to finetune a small model (up to 1b) with 2 epochs in a couple of hours. Bigger models would require CUDA GPUs to train in a reasonable time. Parallel/distributed computing is something that is being explored by some projects but requires an even more complex setup.


Finetuning requires more data and resources than Prompt engineering, but can teach chatbots more subtle, intrinsic behaviours. For most use cases I think we should still use Prompt engineering. Finetuning is only relevant if there is a very specific way of speaking that cannot be requested using a prompt, or there is already a big (1k+) dataset of examples available.

In summary, Finetuning is necessary when you want to change the intrinsic behaviours of a chatbot – the style, format and structure of responses, the use of specific vocabulary or figures of speech. Prompt engineering is still the best way to change what the model answers – up to date or private information. Prompt engineering can tell a model what to say, but Finetuning teaches it how to say it.

Nerd zone

Here are 2 example commands we used to train a model using llama.cpp on an M1 Max:

# How to run finetuning on llama.cpp
$: ../finetune \
    --model-base ../models/Locutusque_TinyMistral-248M/ggml-model-f32.gguf \
    --train-data ../train-data/dataset.txt \
    --threads 10 \
    --batch 4 \
    --sample-start "<s>" \
    --ctx 512 \
    --epochs 4 \
    --checkpoint-in ./checkpoint-LATEST.gguf
# How to run inference on a finetuned model with llama.cpp
$: ../main \
    --model ../models/Locutusque_TinyMistral-248M/ggml-model-f32.gguf \
    --lora ./TinyMistral-248M/ggml-lora-LATEST-f32.gguf \
    --temp 0 \
    --prompt "An FCTO is not just a part-time Chief Technology Officer"