LLMs

LLaMa, PaLM, Mistral 등 최신 LLM 모델들의 학습 데이터와 최적화 기법을 비교 분석합니다. 각 모델의 특성과 최신 기술 트렌드를 한눈에 확인해보세요.

Inc Lomin

Apr 25, 2024

Preliminary

일반적인 LLM 학습 과정

대량의 데이터로 AR 방식으로 pre-training
Human preference alignment를 위해 instruction-tuning을 수행

Supervised fine-tuning, RLHF

보통 closed LLM (ChatGPT, BARD, Claude)는 human preferences로 빡세게 훈련됨 → 일반적으로 성능 재현이 어렵고 많은 cost, human annotation이 소요

RLHF + PPO로 alignment를 많이 했지만, 요즘에는 DPO (Direct Preference Optimization) 이용

LLaMa-2, 2-Chat (LLaMa-1)

LLaMa-1 링크: https://arxiv.org/pdf/2302.13971.pdf

LLaMa-2, 2-Chat 링크: https://arxiv.org/pdf/2307.09288.pdf

Scales

7B, 13B, 70B, 34B (not opened)

Pre-training

LLaMa-1 data (sampling proportion, Disk size)

CommonCrawl (67%, 3.3TB)

C4 (15%, 783GB): cleaned version of CommonCrawl

GitHub (4.5%, 328GB)

Wikipedia (4.5%, 83GB)

Gutenberg and Books (4.5%, 85GB)

ArXiv (2.5%, 92GB)

Stack Exchange (2%, 78GB)

Total 1T, 1.4T tokens 사용

LLaMa-2 data

LLaMa 데이터

+ 출처가 사실적인 데이터를 upsampling 하여 hallucination을 줄임

+ clean

+ new mixes

+ 40% more tokens (2T tokens)

+ 2x length (4k length)

Total 1T, 1.4T tokens 사용

Instruction (fine)-tuning

LLaMa-2-Chat

Instruction-tuning과 RLHF를 통해 몇달간 aligning 수행

Phase 1: Supervised fine-tuning (SFT)

prompt + response (AR 방식)
batch 64, epochs 2, length 4096
data

begin: https://arxiv.org/pdf/2210.11416.pdf → 스타일 등의 다양성이 부족

수백만가지의 데이터 중 좋은 품질의 27,540건만 모아서 수행(Lima paper를 따름)

SFT 데이터셋의 품질이 모델 성능에 큰 영향을 미친다는 사실을 관찰하여 데이터셋을 아웃소싱하는 경우 데이터를 확인하는 것이 필수적임을 강조

Phase 2: RLHF

Annotator들의 결과가 나올 때마다 RLHF-V1, …, RLHF-V5 처럼 연속적으로 보상 모델 등을 업데이트하여 사용

Annotator들이 prompt의 답변에 대한 4단계 선호도와, 안전성에 대해 라벨링을 수행
선호도와 안정성 reward model을 분리하여 총 2개의 reward 모델 학습
Rejection sampling으로 SFT와 같이 학습: LM 정책에 따라 하나의 prompt에서 여러 답변을 보고 가장 높은 reward를 가진 답변으로 SFT 처럼 모델 tuning
PPO

RLHF-V4까지는 rejection sampling fine-tuning만을 사용하고, V5부터는 두가지를 순서대로 조합하였다. (rejection sampling 체크포인트 상에 PPO 적용)
가장 큰 LLaMa-2-Chat (70B)에만 rejection sampling을 수행하고 모든 작은 모델(7B, 13B, 34B)은 70B의 rejection sampling 데이터를 fine-tuning하여 distillation

Ghost Attention (GAtt)

Prompt를 구성할 때 instruction (”emoji로만 답해”) 등을 앞에 넣어 답변을 구한 뒤, 가장 첫 번째 turn만 제외하고 instruction을 제거 후, 붙여서 학습 수행.
Next, we can sample from this synthetic data using the latest RLHF model. We now have a context-dialogue and the sample with which to fine-tune a model, in a process analogous to Rejection Sampling. Instead of augmenting all context-dialogue turns with the instruction, we can drop it in all but the first turn, but this would lead to a mismatch at training time between the system message, i.e., all the intermediate assistant messages that come before the last turn, and our sample. To fix this issue, which could hurt the training, we simply set the loss to 0 for all the tokens from the previous turns, including assistant messages. → 이해 안 감..

Hardware & Training time

A100 2048 사용 (LLaMa)

GPU hours

7B : 184,320h
13B: 368,640h
34B: 1,038,336h
70B: 1,720,320h

Architecture

LLaMa-1 & LLaMa-2

SwiGLU activation (PaLM)

Pre-normalization (GPT-3)

Rotary Positional Embedding (RoPE, GPT-NeoX)

Group Query Attention (GQA) → LLaMa-2에 추가된 내용(34B, 70B에만)

GQA 예시

PaLM-1, 2 (Pathways LM)

PaLM-1 링크: https://arxiv.org/pdf/2204.02311.pdf

PaLM-2 링크: https://arxiv.org/pdf/2305.10403.pdf

Scales

PaLM-1: 8B, 62B, 540B

PaLM-2: 정보 없음

Architecture

PaLM-1

SwiGLU activation
Rotary Positional Embedding (RoPE)

PaLM-2: 정보 없음

Pre-training

PaLM-1 데이터(M: multilingual, E: English)

Social media conversation (M, 50%)
Filtered webpages (M, 27%)
Books (E, 13 %)
GitHub (code, 5%)
Wikipedia (M, 4%)
News (E, 1%)

PaLM-2 데이터(M: multilingual, E: English)

Social media conversation (M, 50%)
Filtered webpages (M, 27%)
Books (E, 13 %)
GitHub (code, 5%)
Wikipedia (M, 4%)
News (E, 1%)

PaLM-1보다 데이터 양을 늘렸지만 영어 비율은 낮춤

Phi-1 (code LLM)

Phi-1 링크 (Textbooks are all you need): https://arxiv.org/pdf/2306.11644.pdf

Scales

base: 1.3B, small: 350M

Architecture

Decoder-only Tr + FlashAttention implementation of multi-head attention

Multi-head attention and MLP layer

CodeGen
PaLM
GPT-NeoX

FP16 training with AdamW

Pre-training

데이터

textbook quality web data (6B tokens)

GPT-3.5 합성 데이터 (1B tokens)

학습

350M: 총 7B 토큰을 26B만큼 pass (approx 4 pass)

1.3B: 총 7B 토큰을 51(8 epoch == 36,000 steps) ~ 76B만큼 pass (approx 8 ~ 11 pass)

Instruction (fine)-tuning

데이터: CodeExercises dataset

200M token fine-tuning (6,000 steps)

Hardware & Training time

Phi-1: A100 x 8 with DeepSpeed (4일)

51B tokens: 770 GPU hours
76B tokens: 1090 GPU hours

Results

Phi-1.5, Phi-1.5-web (common sense, reasoning LLM)

Phi-1.5 링크 (Textbooks are all you need 1.5): https://arxiv.org/pdf/2309.05463.pdf

Scales

1.3B

Architecture

Exactly same as the Phi-1

Pre-training

데이터

Phi-1.5: Phi-1 7B tokens + textbook-like 합성 데이터 (20B tokens)

Phi-1.5-web: 88B (Falcon refined) + 7B (stackoverflow etc) tokes + Phi-1.5 데이터

훈련

phi-1.5: 150B (5 epochs), phi-1.5-web: 300B

sampling (80:20 = newly data : Phi-1 data)

Instruction (fine)-tuning

No instruction-tuning, No RLHF

Hardware & Training time

Phi-1.5: 1,500 GPU hours

Phi-1.5-web: 3,000 GPU hours

Results

Mistral

Mistral 링크: https://arxiv.org/pdf/2310.06825.pdf

Scales

Architecture

Group Query Attention (GQA) ➝ inference 속도 향상, 메모리 감소

Sliding Window Attention (SWA) ➝ long sequence

Rolling Buffer Cache ➝ Sliding window size만큼 cache를 저장하면 되므로 길이가 길어져도 cache 크기는 일정.

Rolling Buffer Cache

Pre-fill and Chunking ➝ Prompt가 들어오면 미리 그 문장을 알고 있으므로, 고정된 크기의 cache를 미리 채울 수 있음

Pre-training

정보 없음

Instruction (fine)-tuning

Chat 기반 모델 제작을 위해 hugging face에 공개된 데이터로 instruction-tuning

Results

LIMA (Less Is More for Alignment)

Mistral 링크: https://arxiv.org/pdf/2305.11206.pdf

Scales

Architecture

LLaMa-1과 동일

Pre-training

LLM을 preference로 alignment하는 과정은 사실상 스타일만 바꾸는 과정이고 대부분의 지식은 pre-training 단계에서 학습하므로, pre-trained LLM을 가져와서 실험.

RLHF 등과 같은 강화학습 없이도 잘 선별된 소수의 데이터를 가지고도 alignment가 가능함

입력의 다양성, 출력의 품질을 확장하면 긍정적인 효과가 있는 반면, 양만 늘릴경우 그렇지 않을 수 있음

Instruction (fine)-tuning

모델

LLaMa 65B fine-tuning

데이터

Stack Exchange (400개)

WikiHow (200개)

Pushshift Reddit (150개)

저자들이 직접 만든 데이터 (250개) ➝ 다양한 스타일의 데이터를 구성하기 위함

전체 tokens: 750,000

학습

15 epochs

Results

선호도

Alpaca, RLHF가 적용된 DaVinci 보다 선호도가 높았던 것을 확인

데이터 quality, quantity diversity

왼쪽 그림에서 high quality에 다양한 답변이 있는 데이터가 0.5포인트나 상승하는 결과를 보여줌

오른쪽 그림에서 데이터 수를 늘리는 것은 많은 ML 환경에서 성능을 개선하기 위해 잘 알려진 전략이지만, 학습 데이터셋을 기하급수적으로 증가시켜 샘플링한 후 학습한 결과를 비교했지만 응답 품질의 개선이 없었음

결론

학습 데이터셋을 기하급수적으로 증가시켜도 LIMA의 응답 품질이 개선되지 않는 것을 확인

데이터셋의 양보다 고품질 프롬프트와 다양성이 중요하다라는 사실

멀티-turn 대화의 성능을 개선하기 위해 단지 30개의 멀티-turn 대화를 fine-tuning하는 것만으로 멀티-turn 대화 성능이 비약적으로 향상

Other models

Tulu 65B

LLaMa 65B fine-tuned (data: FLAN v2, CoT, Dolly, Open Assistant 1, GPT-Alpaca, Code-Alpaca, and ShareGPT)

DPO 사용

Mixtral

8 x 7B (mixture of experts)

data: mistral과 비슷하게 open-web에서 추출(자세한 내용은 없음)

DPO 사용

Multiple Model Merging Tools (Deep Model Fusion)

Mergekit GitHub (Soup, Slerp, TIES merging, etc.): https://github.com/cg123/mergekit

Model soup paper: https://arxiv.org/pdf/2203.05482.pdf

TIES-Merging paper: https://arxiv.org/pdf/2306.01708.pdf

Merging 효과

학습 일절 없이 mergekit 이용해서 리더보드 1,2,3,4등 차지

리더보드 LMCocktail 모델도 Solar + Meow 모델 merging

Model soup

Motivation

fine-tuning process는 크게 두 단계로 이루어짐

다양한 hyperparameter로 여러 모델 학습

그후 validation set에서 가장 높은 accuracy를 보인 모델을 선택 후, 선택된 모델 이외의 나머지는 버림

위 방법의 문제점

out of distribution dataset에 대한 성능 보장이 안 됨
여러 model을 ensemble하면 성능을 높일 수 있지만, inference cost가 증가

Model soup는 위 두 단계 중, 2단계를 개선

같은 pre-trained 모델에서 fine-tuned 된 모델들은 비슷한 loss basin (landscape)을 가짐
Ensemble, weight averaging은 종종 좋은 성능을 보여준 사례가 많음

Model soup는 additional training (X), additional inference cost (X)

Method

Neural\,Net: f(x, \theta)

\theta_i = FineTune(\theta_0, h_i)\, model \,parameter

i=i-th\,model, h=hyperparameter\, configuration,\, \theta_0=pre-trained\,model\,parameter

Best on val. set: 우리가 일반적으로 하나의 모델 선택하는 로직

Ensemble: 모델 앙상블

Uniform soup: n개의 모델의 파라미터를 모두 평균 → 오히려 성능 하락하는 경우도 존재

Greedy soup

모델을 val acc에 대해 내림차순으로 정렬

연속된 모델을 하나씩 파라미터의 평균을 내면서 성능이 하락하는 모델을 버림

Learned soup: 학습을 통해 모델을 합치기 위한 모델 별 최적의 coefficient를 학습을 통해 찾는 방법 → 메모리가 많이 소모 됨

argmin_{\alpha \in \mathbb{R}^k, \beta \in \mathbb{R}} \sum^{n}_{j=1} l \Big(\beta \cdot f(x_j, \sum^k_{i=1} \alpha_i \theta_i), y_j \Big)

Intuition

Loss landscape (loss/error contour)

→ 저렇게 optimal하지 않은 두 모델 사이에 가중치를 interpolation하면, 정확도가 향상될 수 있음을 시사

→ 두 모델이 더 uncorrelated 한 경우(각도가 90도에 가까운 경우), interpolation을 하는 경우 더 높은 정확도를 가져올 수 있음(더 많이 개선될 수 있음)

→ 실제로 위 contour에서 각도 차이가 많이 날수록(90도에 가까울수록) weight average를 했을 때, 결과 개선이 많이 됨

Results

CLIP ViT-B/32 → 72개 모델 중 5개가 greedy logic으로 selected 됨

b. ALIGN → 12개 모델 중 5개가 greedy logic으로 selected 됨

c. ViT-G → 58개 모델 중 14개가 greedy logic으로 selected 됨

distributed data에 대해서도 greedy soup이 좋은 성능을 보임

TIES-merging (TrIm, Elect Sign & Merge)

Motivation

일반적으로 pre-trained 모델을 specific-task에 적용하기 위해 fine-tuning을 수행하지만, 이는 아래와 같은 문제점이 존재

각 application들을 위한 모델을 저장하고 배포해야만함

독립적으로 학습된 모델에 대해서는 관련 task의 다른 정보들을 활용할 수 없다 → in-domain performance를 늘리기 어렵고 out-domain 일반화가 불가능

Multi-task로 위와같은 문제점을 다룰 수 있지만, 모든 작업에 대해 공수가 들고 훈련 cost가 높고, 모든 task에 대해 좋은 결과를 얻기 위한 data mixture 과정이 힘듦

Intuition

Intuition 1

→특정 모델에서 중요하게 작용하는 파라미터가 일반적인 interpolation merging을 하게 된다면 다른 불필요한 파라미터때문에 간섭을 받을 수 있음

Intuition 2

→ 11개의 task에 대해 fine-tuning한 모델에 대해서 top-k%의 largest vector magnitude를 가진 fine-tuned parameter만 유지하고, 나머지는 pre-trained weight로 되돌렸을 때 성능 결과

→11개의 task에 대한 평균 성능 결과이고, 20%만 남겨도 성능 하락이 거의 없음 → 이는 redundant parameter가 많다는 것을 방증

Intuition 3

→ 다른 task로 학습된 모델을 2개부터 차례로 11개까지 merge 할 때 top-20% trimming 후 나타나는 sign conflict 비율

심지어 같은 task로 모델을 학습해도 sign conflict가 나타남 → 같은 task로 학습시켜도 merge할 모델이 많아질수록 conflict 비율이 높아지는 경향이 있음

TIES methods

비고: 그림의 벡터값은 fine-tuned 파라미터 값과 pre-trained 파라미터 값의 차이, 즉 벡터의 표현임

Pre-trained params와 task 에 대해 fine-tuned params의 벡터 차이 를 구함

\tau_t \in \mathbb{R}^d

Trim: parameter magnitude 기반으로 parameter candidate ()를 정함 (top-20 %만 남겨둠) → sign vector ()와 magnitude vector ()로 decompose가 가능 (elementwise 곱)

\hat{\tau} =\hat{\gamma_t}\odot\hat{\mu_t}

Elect Sign: + sign을 가지는 값들과 - sign을 가지는 값들을 더해, magnitude가 큰쪽에 sign을 할당 →
Disjoint Merge: 남은 벡터를 바탕으로 averaging 수행

Results

다른 머지 방법보다는 성능이 좋음.

다만 fine-tuned 모델, multi-task 모델(concatenate 된 데이터 셋으로 학습한 모델)에 비해서는 성능이 하락

TIES merging이 일반화가 잘 되며, task가 늘어나도 좋은 성능을 보임 → task가 많아질수록 sign conflict가 커져서 선응이 하락되는 경향으로 보임