Language modeling
Language modeling is the task of predicting the next word or character in a document.
* Indicates models using dynamic evaluation.
Word Level Models
Penn Treebank
A common evaluation dataset for language modeling ist the Penn Treebank,
as pre-processed by Mikolov et al. (2010).
The dataset consists of 929k training words, 73k validation words, and
82k test words. As part of the pre-processing, words were lower-cased, numbers
were replaced with N, newlines were replaced with <eos>
,
and all other punctuation was removed. The vocabulary is
the most frequent 10k words with the rest of the tokens replaced by an <unk>
token.
Models are evaluated based on perplexity, which is the average
per-word log-probability (lower is better).
Model | Validation perplexity | Test perplexity | Paper / Source | Code |
---|---|---|---|---|
AWD-LSTM-MoS + dynamic eval* by Yang et al. (2018) | 48.33 | 47.69 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | |
AWD-LSTM + dynamic eval* by Krause et al. (2017) | 51.6 | 51.1 | Dynamic Evaluation of Neural Sequence Models | |
AWD-LSTM + continuous cache pointer* by Merity et al. (2017) | 53.9 | 52.8 | Regularizing and Optimizing LSTM Language Models | |
AWD-LSTM-MoS by Yang et al. (2018) | 56.54 | 54.44 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | |
AWD-LSTM by Merity et al. (2017) | 60.0 | 57.3 | Regularizing and Optimizing LSTM Language Models |
WikiText-2
WikiText-2 has been proposed as a more realistic benchmark for language modeling than the pre-processed Penn Treebank. WikiText-2 consists of around 2 million words extracted from Wikipedia articles.
Model | Validation perplexity | Test perplexity | Paper / Source | Code |
---|---|---|---|---|
AWD-LSTM-MoS + dynamic eval* by Yang et al. (2018) | 42.41 | 40.68 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | |
AWD-LSTM + dynamic eval* by Krause et al. (2017) | 46.4 | 44.3 | Dynamic Evaluation of Neural Sequence Models | |
AWD-LSTM + continuous cache pointer* by Merity et al. (2017) | 53.8 | 52.0 | Regularizing and Optimizing LSTM Language Models | |
AWD-LSTM-MoS by Yang et al. (2018) | 63.88 | 61.45 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | |
AWD-LSTM by Merity et al. (2017) | 68.6 | 65.8 | Regularizing and Optimizing LSTM Language Models |
Character Level Models
Hutter Prize
The Hutter Prize Wikipedia dataset, also known as enwik8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.
Model | Bit per Character (BPC) | Number of params (M) | Paper / Source | Code |
---|---|---|---|---|
AWD-LSTM + dynamic eval* by Krause et al. (2017) | 1.08 | 46 | Dynamic Evaluation of Neural Sequence Models | |
3 layer AWD-LSTM by Merity et al. (2018) | 1.232 | 47 | An Analysis of Neural Language Modeling at Multiple Scales | |
Large FS-LSTM-4 by Mujika et al. (2017) | 1.245 | 47 | Fast-Slow Recurrent Neural Networks | |
Large mLSTM +emb +WN +VD by Krause et al. (2016) | 1.24 | 46 | Multiplicative LSTM for sequence modelling | |
FS-LSTM-4 by Mujika et al. (2017) | 1.277 | 27 | Fast-Slow Recurrent Neural Networks | |
Large RHN by Zilly et al. (2016) | 1.27 | 46 | Recurrent Highway Networks |
Text8
The text8 dataset is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.
Model | Bit per Character (BPC) | Number of params (M) | Paper / Source | Code |
---|---|---|---|---|
AWD-LSTM + dynamic eval* by Krause et al. (2017) | 1.19 | 45 | Dynamic Evaluation of Neural Sequence Models | |
Large mLSTM +emb +WN +VD by Krause et al. (2016) | 1.27 | 45 | Multiplicative LSTM for sequence modelling | |
Large RHN by Zilly et al. (2016) | 1.27 | 46 | Recurrent Highway Networks | |
LayerNorm HM-LSTM by Chung et al. (2017) | 1.29 | 35 | Hierarchical Multiscale Recurrent Neural Networks | |
BN LSTM by Cooijmans et al. (2016) | 1.36 | 16 | Recurrent Batch Normalization | |
Unregularised mLSTM by Krause et al. (2016) | 1.4 | 45 | Multiplicative LSTM for sequence modelling |
Penn Treebank
The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset. This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.
Model | Bit per Character (BPC) | Number of params (M) | Paper / Source | Code |
---|---|---|---|---|
3 layer AWD-LSTM by Merity et al. (2018) | 1.175 | 13.8 | An Analysis of Neural Language Modeling at Multiple Scales | |
6 layer QRNN by Merity et al. (2018) | 1.187 | 13.8 | An Analysis of Neural Language Modeling at Multiple Scales | |
FS-LSTM-4 by Mujika et al. (2017) | 1.19 | 27.0 | Fast-Slow Recurrent Neural Networks | |
FS-LSTM-2 by Mujika et al. (2017) | 1.193 | 27.0 | Fast-Slow Recurrent Neural Networks | |
NASCell by Zoph & Le (2016) | 1.214 | 16.3 | Neural Architecture Search with Reinforcement Learning | |
2-Layer Norm HyperLSTM by Ha et al. (2016) | 1.219 | 14.4 | HyperNetworks |