Abstract | ||
---|---|---|
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1162/TACL_A_00461 | Transactions of the Association for Computational Linguistics |
DocType | Volume | Citations |
Journal | 10 | 0 |
PageRank | References | Authors |
0.34 | 0 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Linting Xue | 1 | 0 | 0.68 |
Aditya Barua | 2 | 0 | 0.68 |
Noah Constant | 3 | 0 | 0.68 |
Rami Al-Rfou' | 4 | 1531 | 49.60 |
Sharan Narang | 5 | 335 | 14.44 |
Mihir Kale | 6 | 0 | 2.03 |
Adam Roberts | 7 | 15 | 4.32 |
Colin Raffel | 8 | 190 | 21.50 |