Some Notes of the InstructGPT Paper and RLHF

09 January 2024 - 4 mins read time
Tags: Notes Paper Reading LLMs

Training language models to follow instructionswith human feedback

use reinforcement learning from human feedback to fine-tune GPT-3 to follow a broad class of written instructions
InstructGPT models show improvements in truthfulness over GPT-3
InstructGPT shows small improvements in toxicity over GPT-3, but not bias
InstructGPT models show promising generalization to instructions outside of the RLHF finetuning distribution

Supervised fine-tuning (SFT)
- We fine-tune GPT-3 on our labeler demonstrations using supervised learning
Reward modeling (RM)
- Starting from the SFT model with the final unembedding layer removed, we trained a model to take in a prompt and response, and output a scalar reward
Reinforcement learning (RL)
- fine-tuned the SFT model on our environment using PPO

the model should follow instructions, but also infer intention from a few-shot prompt or another interpretable pattern
Evaluations on API distribution
- human preference ratings on a held out set of prompts from the same source as our training distribution
Evaluations on public NLP datasets

The cost of increasing model alignment is modest relative to pretraining.
We’ve seen some evidence that InstructGPT generalizes ‘following instructions’ to settings that we don’t supervise it in
We were able to mitigate most of the performance degradations introduced by our fine-tuning.
We’ve validated alignment techniques from research in the real world

Many methods could be tried to further decrease the models’ propensity to generate toxic, biased, or otherwise harmful outputs.
Getting models to do what we want is directly related to the steerability and controllability literature
there are many other algorithms that could be used to train policies on our demonstration and comparison data to get even better results

In general, there is not a clear answer on “which model” is the best for the starting point of RLHF
- the design space of options in RLHF training are not thoroughly explored

The underlying goal
- get a model or system
  - takes in a sequence of text
  - returns a scalar reward
    - which should numerically represent the human preference
The training dataset of prompt-generation pairs
- generated by sampling a set of prompts from a predefined dataset
  - OpenAI used prompts submitted by users to the GPT API
Human annotators are used to rank the generated text outputs from the LM
- rankings create a much better regularized dataset

Tools
- Transformers Reinforcement Learning (TRL),
- TRLX which originated as a fork of TRL
- Reinforcement Learning for Language models (RL4LMs)
  - offers building blocks for fine-tuning and evaluating LLMs with a wide variety of RL algorithms (PPO, NLPO, A2C and TRPO), reward functions and metrics.
Datasets
- There is a large dataset created by Anthropic available on the Hub.

Generating well-written human text answering specific prompts is very costly
Human annotators can often disagree
PPO is a relatively old algorithm, but there are no structural reasons that other algorithms could not offer benefits and permutations on the existing RLHF workflow