Don't stop at Supervised Fine-Tuning (SFT). SFT teaches a model *how* to respond in a certain format. Follow it with Direct Preference Optimization (DPO) to teach the model *what* constitutes a good response, using preference pairs to correct undesirable behaviors like fabrication or verbosity.
When fine-tuning a model for question-answering, tokenize questions and answers separately. Then, use a masking technique to force the training process to ignore the question tokens when calculating loss. This concentrates the model's learning on generating correct answers, improving training efficiency and focus.
When using Parameter-Efficient Fine-Tuning (PEFT) with LoRa, applying it to all linear layers yields models that can reason significantly better. This approach moves beyond simply mimicking the style of the training data and achieves deeper improvements in the model's cognitive abilities.
Standard automated metrics like perplexity and loss measure a model's statistical confidence, not its ability to follow instructions. To properly evaluate a fine-tuned model, establish a curated "golden set" of evaluation samples to manually or programmatically check if the model is actually performing the desired task correctly.
