On the diversity of multi-head attention - Citegraph

Paper Info

Title
On the diversity of multi-head attention

Abstract
Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we propose two approaches to better exploit such diversity for multi-head attention, which are complementary to each other. First, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Second, we propose to better capture the diverse information distributed in the extracted partial-representations with the routing-by-agreement algorithm. The routing algorithm iteratively updates the proportion of how much a part (i.e. the distinct information learned from a specific subspace) should be assigned to a whole (i.e. the final output representation), based on the agreement between parts and wholes. Experimental results on the machine translation, sentence encoding and logical inference tasks demonstrate the effectiveness and universality of the proposed approaches, which indicate the necessity of better exploiting the diversity for multi-head attention. While the two strategies individually boost performance, combining them together can further improve the model performance.

Year	DOI	Venue
2021	10.1016/j.neucom.2021.04.038	Neurocomputing
Keywords	DocType	Volume
Natural language processing,Multi-head attention,Diversity,Routing-by-agreement,Neural machine translation,Sentence encoding	Journal	454
ISSN	Citations	PageRank
0925-2312	1	0.40
References	Authors
0	4

Authors (4 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Jian Li	1	162	44.60
Xing Wang	2	58	10.07
Zhaopeng Tu	3	518	39.95
Michael R. Lyu	4	10985	529.03

1