Some Notes of the LLaMA Paper

04 January 2024 - less than 1 min read time
Tags: Notes Paper Reading LLMs

LLaMA: Open and Efficient Foundation Language Models

Approach
- Pre-training Data
  - English CommonCrawl [67%]
  - C4 [15%]
  - Gutenberg and Books3 [4.5%]
  - ArXiv [2.5%]
  - Stack Exchange [2%]
- Tokenizer
  - byte-pair encoding (BPE) algorithm
    - explaination
- Architecture
  - Pre-normalization [GPT3]
    - normalize the input of each transformer sub-layer
      - to improve the training stability
    - RMSNorm
  - SwiGLU activation function [PaLM]
    - SWISH: A SELF-GATED ACTIVATION FUNCTION
  - Rotary Embeddings [GPTNeo]
    - add rotary positional embeddings (RoPE)
- Optimizer
  - AdamW
- Efficient implementation
  - an efficient implementation of the causal multi-head attention
    - to reduce memory usage and runtime
      - by not storing the attention weights and not computing the key/query scores that are masked
    - available in the xformers library
  - reduced the amount of activations that are recomputed during the backward pass with checkpointing
    - by manually implementing the backward function for the transformer layers, instead of relying on the PyTorch autograd.
  - using model and sequence parallelism
Main results
- Common Sense Reasoning
  - BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA
- Closed-book Question Answering
  - Natural Questions, TriviaQA
- Reading Comprehension
  - RACE reading comprehension benchmark
- Mathematical reasoning
  - MATH, GSM8k
- Code generation
  - HumanEval, MBPP
- Massive Multitask Language Understanding
  - MMLU
Instruction Finetuning