Natural Language Processing
models, corpora and software
- lecture 00 — context and background
- Before neural networks were used in NLP, count-based methods like
term-frequency, inverse document frequency (TF-IDF) provided classifiers.
And phrase-based methods provided statistical machine translation.
- More recently, deep learning has given NLP simple, but sophisticated models
to understand and generate human language. And at large scale, the new models
provide impressive, but superficial fluency.
- This lecture explores the beautiful potential to help people learn a new language
or understand someone else's language. And it also explores the dangerous potential
to generate environmental pollution and bad information.
- lecture 01 — tools for our "NLP kitchen"
- In linear regression, the confidence interval around the prediction
is smaller when the predictor variables lie close to the sample means
in the dataset used to estimate the regression model.
- Similarly, language models make better predictions when the subword distribution
at inference matches the subword distribution at training.
- Incorporating linguistic theory into the pre-processing aligns those distributions.
- tools we can use:
- regular expressions
- dictionaries and grammar books
- lemmas, parts of speech and dependency labels
- lecture 02 — word embeddings
- Humans understand a word or phrase by understanding the context in the word or phrase appears.
Similarly, we can train a machine to understand words and phrases by training it to understand
the contexts in which those words or phrases appear.
- So our first language understanding task is to convert words and phrases
into vector representations of their context. Then, from those representations,
we can measure the similarity between words and phrases.
Words and phrases that are close to each other in vector space should have similar meanings.
- lecture 03 — subword segmentation
- Newer segmentation methods provide language-independent
tokenization, detokenization and segmentation from raw sentences,
whereas previous methods assumed prior tokenization.
- This choice is important because once a model has been trained,
the subword vocabulary is "locked in."
It's also important because some languages do not divide sentences into words.
- And it's important because a language model predicts a sequence of subword units.
Incorporating theory into the subword-splitting trains the model
to predict a theoretic sequence of subword units.
- lecture 04 — recurrent neural networks
- RNNs were the first neural model used in machine translation.
Reading words sequentially, they employ gating mechanisms
to identify relationships between separated words in a sequence.
The gating mechanisms use word embeddings to understand context.
- The output that RNNs produced was far more fluent than that of
Nonetheless, phrase-based models continued to outperform RNNs in low-resource cases.
Only when trained on very large datasets, RNN models
outperformed phrase-based models.
- RNNs fell out of favor because recurrent processing requires long training times.
And the gating mechanisms only partially solved the vanishing gradient problem,
which often made RNNs difficult to train.
- lecture 05 — the Transformer
- The Transformer quickly became the neural model of choice for NLP tasks.
- Unlike RNNs, Transformers do not require recurrent processing of a hidden state.
Instead, they encode and decode using only self-attention,
performing computations in parallel, which reduces training costs.
- Self-attention directly models the relationships between words in a sequence
as it examines the links between them.
So when training a Transformer, computations are spent modeling the language
(not encoding/decoding a hidden state vector).
- It's a simpler approach that scales upward to larger datasets.
And it also scales downward to smaller datasets, enabling us to develop
useful, meaningful models in low-resource cases.
- lecture 06 — multilingual translation
- Adding a directional token to the input sequence enables multilingual translation
and improves translation quality for low-resource pairs, an example of transfer-learning.
- This improvement comes at the expense of quality among high-resource pairs when model size
is held constant, so researchers began training very large models to improve translation quality
for all included languages.
- One efficient way to employ transfer-learning is to use linguistic theory
when constructing the training data. As one example, Facebook successfully trained
a 100-language model with a strategy that considered the relationships among languages.
- Their strategy makes extensive use of back-translation, which synthetically creates
source language text from monolingual text (which then serves as target language text).
So this lecture will also explore back-translation, a form of semi-supervised learning.
- lecture 07 — GPT models
- Using a decoder-only Transformer, researchers at OpenAI observed that
pre-training a language model in an unsupervised fashion and then
fine-tuning it for a given task is an effective strategy.
- In their 2018 paper introducing GPT, they hypothesized that
"the more structured attentional memory of the transformer assists
in transfer compared to LSTMs."
- Then in 2019 and 2020, they trained models at a range of sizes and observed that
a model's ability to transfer learning across tasks increases with model size.
Their largest model performed best on all tests.
And sometimes it performed much better than the smaller models.
But in many cases, the improvement was small.
- lecture 08 — BERT models
- Using an encoder-only Transformer, researchers at Google took a different approach to
pre-training and unsupervised learning. They developed a "bidirectional" model, BERT,
which allows each token to attend to all tokens in the self-attention layers, so that
the representation learns from context on both sides.
- For comparison, GPT (like the original Transformer) only allows a token to attend to previous tokens.
- So to prevent trivial predictions, the team that developed BERT introduced masked language modeling.
During pre-training, they randomly replaced a fraction of the tokens
with either a "mask" token, a random token or the same (unchanged) token,
then they pre-trained BERT to predict those tokens.
- And to capture the relationship between two sentences,
they created paired training examples in which the second sentence
was either the next sentence or a random sentence.
- The bidirectional cross-attention between the two sentences makes BERT a good choice
for question-answering tasks, inference tasks and classification tasks.
- lecture 09 — "Stochastic Parrots"
- Large language models impressively generate coherent, fluent text by predicting a
sequence of subword units. This concluding lecture will consider optimal model size
and what language models truly understand.
- Large models perform better, but they also cost more to train and deploy.
So given the diminishing marginal returns to model capacity,
the profit-maximizing model size may be quite small.
- Training and fine-tuning large language models consumes energy.
And often times that energy does not come from renewable sources.
- Training models on mostly-English datasets leaves other languages poorly served.
And even within English, the data may not reflect the way the language has changed
and is changing in response to changing social views, changing opinions.
- But with careful thought and planning, we can develop language models
that understand the language that we speak.
Copyright © 2002-2023 Eryk Wdowiak