Despite the success of Instruction Tuning (IT) in training large language models (LLMs), such models often leverage spurious or biased features learnt from their training data and can become misaligned, leading to undesired behaviours. While existing techniques can steer model behaviour at inference-time, they are often post-hoc and do not embed steering as an intrinsic model feature. In this work, we introduce Focus Instruction Tuning (FIT), which trains LLMs to condition their responses by focusing on specific features whilst ignoring others, leading to different behaviours based on what features are specified. Across diverse benchmarks, we demonstrate that FIT: (i) successfully steers behaviour at inference time; (ii) increases robustness by amplifying core task signals and down-weighting spurious cues; (iii) mitigates social bias by suppressing demographic attributes; and (iv) generalises under distribution shifts and to previously unseen focus features. FIT therefore offers a lightweight, intrinsic mechanism for building more robust, fair, and easily controllable LLMs.
Large language models (LLMs), like those powering chatbots, are trained to follow instructions. However, further training of such models to solve specific tasks can sometimes cause them to overlook important goals such as safety or fairness. For instance, if a model is fine-tuned to be more helpful, it might start providing sensitive information or inappropriately reinforce social biases. While some existing techniques can adjust how a model acts after training, these methods are often complicated, difficult to use, and don’t let everyday users easily guide the model with plain natural language. To address this, we propose **focus-instruction-tuning (FIT)**: a new technique that lets users adaptively control a model’s responses using simple, natural language instructions at the time of use. For example, a user could include, “Answer this question without using gender stereotypes” directly in their prompt, and the model will adapt its response accordingly, helping to correct misalignments such as gender bias that can result from training. Our findings show that FIT makes language models more flexible and dependable. It allows them to follow new instructions across a wide range of situations, helps reduce unwanted bias in their answers, and continues to work well even when faced with new topics or unexpected changes in data.
@article{lamb2024focus,
title={Focus On This, Not That! Steering LLMs With Adaptive Feature Specification},
author={Lamb, Tom A and Davies, Adam and Paren, Alasdair and Torr, Philip HS and Pinto, Francesco},
journal={arXiv preprint arXiv:2410.22944},
year={2024}
}