Abstract | ||
---|---|---|
Large Transformer models routinely achieve state-of-the-art results on
a number of tasks but training these models can be prohibitively costly,
especially on long sequences. We introduce two techniques to improve
the efficiency of Transformers. For one, we replace dot-product attention
by one that uses locality-sensitive hashing, changing its complexity
from O($L^2$) to O($L \log L$), where $L$ is the length of the sequence.
Furthermore, we use reversible residual layers instead of the standard
residuals, which allows storing activations only once in the training
process instead of N times, where N is the number of layers.
The resulting model, the Reformer, performs on par with Transformer models
while being much more memory-efficient and much faster on long sequences. |
Year | Venue | Keywords |
---|---|---|
2020 | ICLR | attention, locality sensitive hashing, reversible layers |
DocType | Citations | PageRank |
Conference | 4 | 0.39 |
References | Authors | |
10 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Nikita Kitaev | 1 | 4 | 0.39 |
Łukasz Kaiser | 2 | 2307 | 89.08 |
Anselm Levskaya | 3 | 4 | 0.39 |