# Semi-Supervised Learning for NLP

CS224n lecture 17.

## Why use semi-supervised learning for NLP?

Why has deep learning been so successful recently?

• Better “tricks” (dropout, improved optimizers (e.g., Adam), batch norm, attention)
• Better hardware (thanks video games!) -> larger models
• Larger datasets

• Use unlabeled examples during training
• Easy to find for NLP!

## Semi-supervised learning algorithms

We will cover three semi-supervised learning techniques :

• Pre-training
• One of the tricks that started to make NNs successful
• Self-training
• One of the oldest and simplest semi-supervised learning algorithms (1960s)
• Consistency regularization

• Recent idea (2014, lots of active research)
• Great results for both computer vision and NLP

### Pre-training

1. First train an unsupervised model on unlabeled data
2. Then incorporate the model’s learned weights into a supervised model and train it on the labeled data
• Optional: continue fine-tuning the unsupervised weights.

Pre-training Example : Word2Vec

• Shared part is word embeddings
• No unsupervised-only part
• Supervised-only part is the rest of the model
• Unsupervised learning: skip-gram/cbow/glove/etc
• Supervised learning: training on some NLP task

_Why does pre-training work?_

• ”Smart” initialization for the model
• More meaningful representations in the model
• e.g., GloVe vectors capture a lot about word meaning, our model no longer has to learn the meanings itself

Pre-Training Strategies: Auto-Encoder (Dai & Le, 2015)

• For pre-training, train an autoencoder:
• seq2seq model (without attention) where the target sequence is the input sequence
• the encoder converts the input into a vector that contains enough information that the input can be recovered
• Initialize the LSTM for a sentence classification model with the encoder

• CoVe (McCann et al., 2017) : Learned in Translation: Contextualized Word Vectors
• ELMo (Peters et al., 2017) : Semi-supervised sequence tagging with bidirectional language models

### Self-training

• Use unlabeled data without a giant model or long pretraining phase
• Old (1960s) and simple semi-supervised algorithm
• Algorithm:
1. Train the model on the labeled data.
2. Have the model label the unlabeled data.
• Take some of examples the model is most confident about (i.e., the model gives them high probability). Add those examples with the model’s labels to the training set
1. Go back to 1.

#### Online Self-Training

note : convert model-produced label to one-hot vector, [0.1,0.5,0.4] =>[0,1,0]

### Consistency regularization

• Train the model so a bit of noise doesn’t mess up its predictions
• Equivalently, the model must give consistent predictions to nearby data points

Amazing Results for Computer Vision!

#### How to Apply Consistency Regularization to NLP?

##### Virtual Adversarial Training (Miyato et al., 2017)
• Apply consistency regularization to text classification
• First embed the words
• Add the noise to the word embeddings
• Have to constrain the word embeddings (e.g., make them have zero mean and unit variance)
• Otherwise the model could just make them have really large magnitude so the noise doesn’t change anything
• Noise added to the word embeddings is not chosen randomly: it is chosen adversarially