Semi-Supervised Learning for NLP. CS224n lecture 17.
Why use semi-supervised learning for NLP?
Why has deep learning been so successful recently?
- Better “tricks” (dropout, improved optimizers (e.g., Adam), batch norm, attention)
- Better hardware (thanks video games!) -> larger models
- Larger datasets
NLP Dataset Sizes
:
Semi-Supervised Learning
:
- Use unlabeled examples during training
Easy to find for NLP!
Semi-supervised learning algorithms
We will cover three semi-supervised learning techniques :
- Pre-training
- One of the tricks that started to make NNs successful
- You learned about this in week 1 (word2vec)!
- Self-training
- One of the oldest and simplest semi-supervised learning algorithms (1960s)
Consistency regularization
- Recent idea (2014, lots of active research)
- Great results for both computer vision and NLP
Pre-training
- First train an unsupervised model on unlabeled data
- Then incorporate the model’s learned weights into a supervised model and train it on the labeled data
- Optional: continue fine-tuning the unsupervised weights.
Pre-training Example : Word2Vec
- Shared part is word embeddings
- No unsupervised-only part
- Supervised-only part is the rest of the model
- Unsupervised learning: skip-gram/cbow/glove/etc
- Supervised learning: training on some NLP task
_Why does pre-training work?_
- ”Smart” initialization for the model
- More meaningful representations in the model
- e.g., GloVe vectors capture a lot about word meaning, our model no longer has to learn the meanings itself
Pre-Training for NLP :
Pre-Training Strategies: Auto-Encoder (Dai & Le, 2015)
- For pre-training, train an autoencoder:
- seq2seq model (without attention) where the target sequence is the input sequence
- the encoder converts the input into a vector that contains enough information that the input can be recovered
- Initialize the LSTM for a sentence classification model with the encoder
More :
- CoVe (McCann et al., 2017) : Learned in Translation: Contextualized Word Vectors
ELMo (Peters et al., 2017) : Semi-supervised sequence tagging with bidirectional language models
Self-training
- Use unlabeled data without a giant model or long pretraining phase
- Old (1960s) and simple semi-supervised algorithm
- Algorithm:
- Train the model on the labeled data.
- Have the model label the unlabeled data.
- Take some of examples the model is most confident about (i.e., the model gives them high probability). Add those examples with the model’s labels to the training set
- Go back to 1.
- Take some of examples the model is most confident about (i.e., the model gives them high probability). Add those examples with the model’s labels to the training set
Online Self-Training
note : convert model-produced label to one-hot vector, [0.1,0.5,0.4] =>[0,1,0]
Hard vs Soft Targets
:
Consistency regularization
- Train the model so a bit of noise doesn’t mess up its predictions
- Equivalently, the model must give consistent predictions to nearby data points
Example
:
Amazing Results for Computer Vision!
How to Apply Consistency Regularization to NLP?
Virtual Adversarial Training (Miyato et al., 2017)
- Apply consistency regularization to text classification
- First embed the words
- Add the noise to the word embeddings
- Have to constrain the word embeddings (e.g., make them have zero mean and unit variance)
- Otherwise the model could just make them have really large magnitude so the noise doesn’t change anything
- Noise added to the word embeddings is not chosen randomly: it is chosen adversarially
Adversarial Examples
Small (imperceptible to humans) tweak to neural network inputs can change its output.
Creating an adversarial example:
- Compute the gradient of the loss with respect to the input
- Add epsilon times the gradient to the input
- Possibly repeat multiple times