Some Notes of the LLaMA Paper
04 January 2024 -
less than 1 min read time
Tags:
Notes
Paper Reading
LLMs
- Approach
- Pre-training Data
- English CommonCrawl [67%]
- C4 [15%]
- Gutenberg and Books3 [4.5%]
- ArXiv [2.5%]
- Stack Exchange [2%]
- Tokenizer
- byte-pair encoding (BPE) algorithm
- Architecture
- Pre-normalization [GPT3]
- normalize the input of each transformer sub-layer
- to improve the training stability
- RMSNorm
- SwiGLU activation function [PaLM]
- Rotary Embeddings [GPTNeo]
- Optimizer
- Efficient implementation
- an efficient implementation of the causal multi-head attention
- to reduce memory usage and runtime
- by not storing the attention weights and not computing the key/query scores that are masked
- available in the xformers library
- reduced the amount of activations that are recomputed during the backward pass with checkpointing
- by manually implementing the backward function for the transformer layers, instead of relying on the PyTorch autograd.
- using model and sequence parallelism
- Main results
- Common Sense Reasoning
- BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA
- Closed-book Question Answering
- Natural Questions, TriviaQA
- Reading Comprehension
- RACE reading comprehension benchmark
- Mathematical reasoning
- Code generation
- Massive Multitask Language Understanding
- Instruction Finetuning