Abstract | ||
---|---|---|
We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing). |
Year | Venue | DocType |
---|---|---|
2020 | ICLR | Conference |
Citations | PageRank | References |
0 | 0.34 | 17 |
Authors | ||
6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Lingpeng Kong | 1 | 239 | 17.09 |
Cyprien de Masson d'Autume | 2 | 8 | 2.14 |
Lei Yu | 3 | 220 | 11.55 |
Ling Wang | 4 | 884 | 52.37 |
Zihang Dai | 5 | 171 | 12.81 |
Dani Yogatama | 6 | 855 | 42.43 |