Synthesizer: Rethinking Self-Attention for Transformer Models

Dec 08, 2021

Synthesizer: Rethinking Self-Attention for Transformer Models

Contents

Introduction Proposed Method Experiment Conclusions

Transformer가 뛰어난 성능을 보이는것은 self-attention 의 강력함 때문이라고 많이들 이야기합니다. 본 논문은 self-attention의 결과로 나온 attention weight에 대한 의문을 제기한 논문으로, (key, value) 모든 pair의 dot-product로 구한 attention weight를 사용하지 않은 경우에 대해서도 성능이 뛰어나다는것을 보여주었습니다.

Introduction

BERT를 포함한 Transformer 모델들이 자연어 처리에서 매우 뛰어난 성능을 보여주고 있습니다. Transformer가 높은 성능을 보이는 이유는 무엇일까요? 대부분은 transformer의 self-attention때문이라고 생각합니다. self-attention 에서 attention weight를 구하는 과정을 살펴보면, 모든 (query, key) pair에 대해서 dot product를 진행하게 됩니다. 과연 해당 연산을 통해 attention weight를 구하는게 맞는지에 대한 의문을 제기합니다.

본 논문에서 attention-weight 그 자체를 학습 파라미터로 두고 학습을 시키는 경우를 포함하여, 모든 (query, key) pair에 대해서 dot product를 진행하지 않고 attention-weight 를 구하는 방법을 제시하였고, 해당 방법론에 대한 검증을 하였습니다.

Proposed Method

Notations

h

: multi-head attention에 대한 index

l

: layer에 대한 index

d

: hidden dimension

N

: sequence length

X_{h,l}\in R^{N\times d}

: input of transformer block

Y_{h,l}\in R^{N\times d}

: input of transformer block

Proposed Method

기존의 attention weight 는 query 와 key 의 모든 페어에 대해서 dot-product 를 해주는 방식으로 구합니다.

a = softmax(Q_{h,l}K_{h,l}^{T})

(where,

Q_{h,l}=W_qX_{h,l}, K_{h,l}=W_kX_{h,l}

)

1) Dense Synthesizer

Dense Synthesizer는 query와 key의 모든 페어를 계산하는것이 아닌, Query (혹은 key)만을 이용해서 attention weight를 구해줍니다

a=W_2(\sigma(W_1X_{h,l}))

where\ \ W_1\in R^{d\times d} \ \ \ W_2\in R^{d\times N}

2) Random Synthesizer

Random synthesizer는 attention weight를 파라미터 그 자체로 봅니다.

a = R_{h,l} \ \ where \ \ R_{h,l} \in R^{N \times N}

The basic idea : 모든 학습데이터를 어우를 수 있는 Global feature(attention weight) 를 배우겠다.

해당 방식대로 진행한다면, dense synthesizer는

N\times d

개의 parameter가 추가되고, random synthesizer는

N\times N

개의 parameter가 추가됩니다. Transformer layer가 깊어지면 깊어질수록 해당 parameter에 대한 부담이 많아지기 때문에, Dense synthesizer와 random synthesizer를 factorization을 진행하여 parameter 숫자를 줄여줍니다.

3) Factorized Dense Synthesizer

A=F_A(X_{h,l}), \ B=F_B(X_{h,l})

where,\ F_{A} \in R^{N\times a}, \ \ F_{B} \in R^{N\times b}

and

\ \ a\times b=N

a=H_A(A)*H_B(B)

where, \ H_A

and

H_B

are tiling function :

H_A(*)

makes

R^{N\times a}

R^{N\times (a \times b)}

H_B(*)

makes

R^{N\times b}

R^{N\times (b \times a)}

4) Factorized Random Synthesizer

a = R_1 R_2^T \ \ \ where\ R_1\in R^{N\times k} , \ R_2\in R^{N\times k}

5) Mixture of Synthesizer

a = \alpha_1S_1+\alpha_2S_2+...+\alpha_NS_N

Summary

Experiment

vanilla transformer와 비교실험을 진행하였고, 4가지 task에 대해서 실험을 진행했습니다.

Machine Translation (EnDe, EnFr)

Language Model (LM1B)

Text Generation (Summarization, dialogue)

Multi-task NLP (GLUE, SuperGLUE)

Machine Translation & Language Modeling

NMT의 경우, synthesizer를 쓰면 성능이 크게 drop되지 않고, R+D를 하면 기존 transformer보다 좋아짐
LM의 경우에도 대부분 경향성은 같음

Text Generation

Summarization의 경우 Random synth 나 Dense synth를 넣으면 성능이 낮아지고, Vanilla attention과 함께 사용하는 경우에 최고 성능을 기록합니다
Dialogue의 경우, Random synth와 Dense synth가 성능이 더 좋아집니다. 다른 task들과 다르게 Vanilla attention을 쓰는 경우, 성능이 떨어집니다.

Multi-task NLP

대부분의 task가 Random synth와 dense synth가 성능이 낮다. 그런 이유는 T5의 attention이 cross sentence attention과 유사한데, random synth나 dense synth가 그것 보다는 성능이 낮은것으로 추정된다.
SST (sentimental analysis)와 같이 쉬운경우는 Random synth나 dense synth가 성능이 비슷함
Random + vanilla attention이 대부분 최고 성능

Qualitative Analysis

Attention weight histogram

The weights, however, seem smoother and less coarse as compared to the Transformer. This seems to reflect what we expect since the Synthesizer does not benefit from token specific information.

Synthesizer: Rethinking Self-Attention for Transformer Models

Introduction

Proposed Method

Notations

Proposed Method

Experiment

Qualitative Analysis

Conclusions

Optional subsections (Heading 3)