Natural Language Processing

about this course

Machines have had some ability to understand and generate text for several decades. More recently, neural approaches have revolutionized the field. The most striking and impressive models are large Transformer-based models trained on large amounts of text.

But if "Attention is All You Need," then maybe we don't need those large language models at all. If "Language Models are Few-Shot Learners," then maybe small language models can learn from a few examples too. Maybe we can develop a small language model that better understands our language if we "pay attention" to the language.

Training a model to examine the links between words in a sequence and directly model those relationships trains the model to understand human language. Super-sizing a model super-sizes the cost. It does not super-size the understanding.

Larger models tend to perform better than smaller models, but performance gains diminish as model size increases.

At large sizes, language models can become superficially fluent without truly understanding human language. Such "Stochastic Parrots" simply repeat long sequences that they learned from the training data. So we need to "pay attention" to those datasets and ask what the model has learned.

Language models perform better in the domain that they have been trained on. A model trained on Wikipedia will not understand Huckleberry Finn. So we also need to ask if there is good reason to believe that a large language model can be fine-tuned for a given task. Many times there will be a good reason. Sometimes there will not.

And when the language is not English, a large language model that can be fine-tuned for any task might not even exist at all. In those cases, we need to "pay attention" to the language, so that we can train a small language model that understands our (non-English) language.

what you will learn

This course will compare the performance of RNNs, Transformers, BERT and GPT to previous approaches. And it will pay particular attention to how those performance gains were achieved. Did the researchers develop a better model? Or did they train a larger model?

For example, the fluency and translation quality of neural translation models far surpasses that of phrase-based statistical models. And in low-resource cases too.

But what's important is how those performance gains were achieved. Instead of translating words or phrases, the neural approach attempts to understand context. Neural models translate better than phrase-based models because they attempt to create a sentence in the target language with the same meaning as the source language sentence.

In that spirit, this course will explore neural approaches to natural language processing. Comparing them, it will ask how we can develop models that better understand our language.

By training small comparably-sized models, we can compare approaches. Holding model size constant, we'll ask which training or fine-tuning technique performs best on a given task. Identifying the techniques that work well at small scale, we'll find techniques that work exceptionally well at large scale.

links and files

course syllabus

models, corpora and software

the Transformer
multilingual translation
GPT models
BERT models
"Stochastic Parrots"

course outline

lecture 00 — context and background

themes:

Before neural networks were used in NLP, count-based methods like term-frequency, inverse document frequency (TF-IDF) provided classifiers. And phrase-based methods provided statistical machine translation.
More recently, deep learning has given NLP simple, but sophisticated models to understand and generate human language. And at large scale, the new models provide impressive, but superficial fluency.
This lecture explores the beautiful potential to help people learn a new language or understand someone else's language. And it also explores the dangerous potential to generate environmental pollution and bad information.

readings:

Bender and Gebru et al (2021). "On the Dangers of Stochastic Parrots"
Koehn and Knowles (2017). "Six Challenges for Neural Machine Translation"
Sennrich and Zhang (2019). "Revisiting Low-Resource NMT"

lecture 01 — tools for our "NLP kitchen"

theme:

In linear regression, the confidence interval around the prediction is smaller when the predictor variables lie close to the sample means in the dataset used to estimate the regression model.
Similarly, language models make better predictions when the subword distribution at inference matches the subword distribution at training.
Incorporating linguistic theory into the pre-processing aligns those distributions.

tools we can use:

regular expressions
dictionaries and grammar books
lemmas, parts of speech and dependency labels

readings:

Sennrich and Haddow (2016). "Linguistic Input Features Improve NMT"
Oncevay et al (2022). "Revisiting Syllables"
Wdowiak (2022). "A Recipe for Low-Resource NMT"

lecture 02 — word embeddings

theme:

Humans understand a word or phrase by understanding the context in the word or phrase appears. Similarly, we can train a machine to understand words and phrases by training it to understand the contexts in which those words or phrases appear.
So our first language understanding task is to convert words and phrases into vector representations of their context. Then, from those representations, we can measure the similarity between words and phrases. Words and phrases that are close to each other in vector space should have similar meanings.

readings:

Mikolov et al (2013). "Efficient Estimation of Word Representations in Vector Space"
Mikolov et al (2013). "Distributed Representations of Words and Phrases"
Pennington et al (2014). "GloVe: Global Vectors for Word Representation"

lecture 03 — subword segmentation

theme:

Newer segmentation methods provide language-independent tokenization, detokenization and segmentation from raw sentences, whereas previous methods assumed prior tokenization.
This choice is important because once a model has been trained, the subword vocabulary is "locked in." It's also important because some languages do not divide sentences into words.
And it's important because a language model predicts a sequence of subword units. Incorporating theory into the subword-splitting trains the model to predict a theoretic sequence of subword units.

readings:

Sennrich, Haddow and Birch (2015). "NMT of Rare Words with Subword Units"
Kudo and Richardson (2018). "SentencePiece"

lecture 04 — recurrent neural networks

themes:

RNNs were the first neural model used in machine translation. Reading words sequentially, they employ gating mechanisms to identify relationships between separated words in a sequence. The gating mechanisms use word embeddings to understand context.
The output that RNNs produced was far more fluent than that of phrase-based models. Nonetheless, phrase-based models continued to outperform RNNs in low-resource cases. Only when trained on very large datasets, RNN models outperformed phrase-based models.
RNNs fell out of favor because recurrent processing requires long training times. And the gating mechanisms only partially solved the vanishing gradient problem, which often made RNNs difficult to train.

readings:

Bahdanau, Cho and Bengio (2014). "NMT by Jointly Learning to Align and Translate"
Wu et al (2016). "Google's NMT System"
Neubig (2017). "NMT and Sequence-to-Sequence Models: a Tutorial"

lecture 05 — the Transformer

themes:

The Transformer quickly became the neural model of choice for NLP tasks.
Unlike RNNs, Transformers do not require recurrent processing of a hidden state. Instead, they encode and decode using only self-attention, performing computations in parallel, which reduces training costs.
Self-attention directly models the relationships between words in a sequence as it examines the links between them. So when training a Transformer, computations are spent modeling the language (not encoding/decoding a hidden state vector).
It's a simpler approach that scales upward to larger datasets. And it also scales downward to smaller datasets, enabling us to develop useful, meaningful models in low-resource cases.

readings:

Vaswani et al (2017). "Attention is All You Need"
Harvard NLP (2018, 2022). "The Annotated Transformer"

lecture 06 — multilingual translation

theme:

Adding a directional token to the input sequence enables multilingual translation and improves translation quality for low-resource pairs, an example of transfer-learning.
This improvement comes at the expense of quality among high-resource pairs when model size is held constant, so researchers began training very large models to improve translation quality for all included languages.
One efficient way to employ transfer-learning is to use linguistic theory when constructing the training data. As one example, Facebook successfully trained a 100-language model with a strategy that considered the relationships among languages.
Their strategy makes extensive use of back-translation, which synthetically creates source language text from monolingual text (which then serves as target language text). So this lecture will also explore back-translation, a form of semi-supervised learning.

readings:

Johnson et al (2016). "Google's Multilingual NMT: Enabling Zero-Shot Translation"
Fan et al (2020). "Beyond English-Centric Multilingual Machine Translation"
Arivazhagan et al (2019). "Massively Multilingual NMT in the Wild"
Zhang et al (2020). "Improving Massively Multilingual NMT and Zero-Shot Translation"
Kudugunta et al (2019). "Investigating Multilingual NMT Representations at Scale"
Sennrich, Haddow and Birch (2015). "Improving NMT Models with Monolingual Data"

lecture 07 — GPT models

themes:

Using a decoder-only Transformer, researchers at OpenAI observed that pre-training a language model in an unsupervised fashion and then fine-tuning it for a given task is an effective strategy.
In their 2018 paper introducing GPT, they hypothesized that "the more structured attentional memory of the transformer assists in transfer compared to LSTMs."
Then in 2019 and 2020, they trained models at a range of sizes and observed that a model's ability to transfer learning across tasks increases with model size. Their largest model performed best on all tests. And sometimes it performed much better than the smaller models. But in many cases, the improvement was small.

readings:

Radford et al (2018). "Improving Language Understanding by Generative Pre-Training"
Radford et al (2019). "Language Models are Unsupervised Multitask Learners"
Brown et al (2020). "Language Models are Few-Shot Learners"

lecture 08 — BERT models

themes:

Using an encoder-only Transformer, researchers at Google took a different approach to pre-training and unsupervised learning. They developed a "bidirectional" model, BERT, which allows each token to attend to all tokens in the self-attention layers, so that the representation learns from context on both sides.
For comparison, GPT (like the original Transformer) only allows a token to attend to previous tokens.
So to prevent trivial predictions, the team that developed BERT introduced masked language modeling. During pre-training, they randomly replaced a fraction of the tokens with either a "mask" token, a random token or the same (unchanged) token, then they pre-trained BERT to predict those tokens.
And to capture the relationship between two sentences, they created paired training examples in which the second sentence was either the next sentence or a random sentence.
The bidirectional cross-attention between the two sentences makes BERT a good choice for question-answering tasks, inference tasks and classification tasks.

readings:

Devlin et al (2019). "BERT: Pre-Training of Deep Bidirectional Transformers"
Pires et al (2019). "How Multilingual is Multilingual BERT?"
Lewis et al (2019). "BART: Denoising Sequence-to-Sequence Pre-Training"
Sanh et al (2020). "DistilBERT: a Distilled Version of BERT"

lecture 09 — "Stochastic Parrots"

themes:

Large language models impressively generate coherent, fluent text by predicting a sequence of subword units. This concluding lecture will consider optimal model size and what language models truly understand.
Large models perform better, but they also cost more to train and deploy. So given the diminishing marginal returns to model capacity, the profit-maximizing model size may be quite small.
Training and fine-tuning large language models consumes energy. And often times that energy does not come from renewable sources.
Training models on mostly-English datasets leaves other languages poorly served. And even within English, the data may not reflect the way the language has changed and is changing in response to changing social views, changing opinions.
But with careful thought and planning, we can develop language models that understand the language that we speak.

readings:

Bender and Gebru et al (2021). "On the Dangers of Stochastic Parrots"