Do Transformer Modifications Transfer Across Implementations and Applications? | 0 | 0.34 | 2021 |
Searching for Efficient Transformers for Language Modeling. | 0 | 0.34 | 2021 |
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding | 0 | 0.34 | 2021 |
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. | 0 | 0.34 | 2021 |
How Much Knowledge Can You Pack Into the Parameters of a Language Model? | 0 | 0.34 | 2020 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | 2 | 0.46 | 2020 |
Corpora Generation for Grammatical Error Correction. | 0 | 0.34 | 2019 |
Music Transformer: Generating Music with Long-Term Structure. | 0 | 0.34 | 2019 |
Music Transformer - Generating Music with Long-Term Structure. | 4 | 0.42 | 2019 |
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. | 7 | 0.46 | 2018 |
An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation. | 2 | 0.41 | 2018 |
Generating Wikipedia by Summarizing Long Sequences. | 18 | 0.62 | 2018 |
Blockwise Parallel Decoding for Deep Autoregressive Models. | 3 | 0.37 | 2018 |
Weakly Supervised Grammatical Error Correction using Iterative Decoding. | 0 | 0.34 | 2018 |
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation. | 2 | 0.46 | 2018 |
Generating Wikipedia by Summarizing Long Sequences. | 0 | 0.34 | 2018 |
Tensor2Tensor for Neural Machine Translation. | 19 | 0.70 | 2018 |
Image Transformer. | 0 | 0.34 | 2018 |
Fast Decoding in Sequence Models using Discrete Latent Variables. | 11 | 0.57 | 2018 |
Mesh-TensorFlow: Deep Learning for Supercomputers. | 4 | 0.40 | 2018 |
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. | 95 | 3.02 | 2017 |
Attention Is All You Need. | 432 | 6.52 | 2017 |
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. | 0 | 0.34 | 2017 |
One Model To Learn Them All. | 29 | 0.76 | 2017 |
Exploring the Limits of Language Modeling. | 153 | 5.35 | 2016 |
Sparse Non-negative Matrix Language Modeling. | 1 | 0.36 | 2016 |
Swivel: Improving Embeddings by Noticing What's Missing. | 11 | 0.55 | 2016 |
Sparse non-negative matrix language modeling for geo-annotated query session data | 1 | 0.37 | 2015 |
Pruning Sparse Non-Negative Matrix N-Gram Language Models | 1 | 0.37 | 2015 |
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks | 208 | 6.26 | 2015 |
Sparse Non-Negative Matrix Language Modeling For Skip-Grams | 7 | 0.51 | 2015 |
Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation. | 4 | 0.47 | 2014 |
Variational Program Inference | 1 | 0.42 | 2010 |
A probabilistic approach to solving crossword puzzles | 29 | 2.12 | 2002 |
Solving Crosswords with PROVERB | 4 | 1.60 | 1999 |
Solving crossword puzzles as probabilistic constraint satisfaction | 16 | 2.48 | 1999 |
PROVERB: The Probabilistic Cruciverbalist | 25 | 3.97 | 1999 |