Some Notes of the InstructGPT Paper and RLHF

Training language models to follow instructionswith human feedback

Introduction

Methods and experimental details

High-level methodology

Models

Evaluation

Discussion

Implications for alignment research

Open questions

Illustrating Reinforcement Learning from Human Feedback (RLHF)

RLHF: Let’s take it step by step

Pretraining language models

Reward model training

Fine-tuning with RL

Open-source tools for RLHF

What’s next for RLHF?