Semi-Supervised Learning for NLP

NLP

Semi-Supervised Learning for NLP. CS224n lecture 17.

Why use semi-supervised learning for NLP?

Why has deep learning been so successful recently?

Better “tricks” (dropout, improved optimizers (e.g., Adam), batch norm, attention)
Better hardware (thanks video games!) -> larger models
Larger datasets

NLP Dataset Sizes :

NLP datasets

Semi-Supervised Learning :

Use unlabeled examples during training
Easy to find for NLP!

Semi-supervised learning algorithms
We will cover three semi-supervised learning techniques :
Pre-training
- One of the tricks that started to make NNs successful
- You learned about this in week 1 (word2vec)!
Self-training
- One of the oldest and simplest semi-supervised learning algorithms (1960s)
Consistency regularization
- Recent idea (2014, lots of active research)
- Great results for both computer vision and NLP
Pre-training

First train an unsupervised model on unlabeled data
Then incorporate the model’s learned weights into a supervised model and train it on the labeled data
- Optional: continue fine-tuning the unsupervised weights.

pre-training

Pre-training Example : Word2Vec

_Why does pre-training work?_

”Smart” initialization for the model
More meaningful representations in the model
- e.g., GloVe vectors capture a lot about word meaning, our model no longer has to learn the meanings itself

Pre-Training for NLP :

Pre-Training for NLP

Pre-Training Strategies: Auto-Encoder (Dai & Le, 2015)

For pre-training, train an autoencoder:
- seq2seq model (without attention) where the target sequence is the input sequence
- the encoder converts the input into a vector that contains enough information that the input can be recovered
Initialize the LSTM for a sentence classification model with the encoder

CoVe (McCann et al., 2017) : Learned in Translation: Contextualized Word Vectors
ELMo (Peters et al., 2017) : Semi-supervised sequence tagging with bidirectional language models

Self-training
- Use unlabeled data without a giant model or long pretraining phase
- Old (1960s) and simple semi-supervised algorithm
- Algorithm:
  1. Train the model on the labeled data.
  2. Have the model label the unlabeled data.
    - Take some of examples the model is most confident about (i.e., the model gives them high probability). Add those examples with the model’s labels to the training set
      1. Go back to 1.

note : convert model-produced label to one-hot vector, [0.1,0.5,0.4] =>[0,1,0]

Hard vs Soft Targets :

Example :

Amazing Results for Computer Vision!

Apply consistency regularization to text classification
- First embed the words
- Add the noise to the word embeddings
- Have to constrain the word embeddings (e.g., make them have zero mean and unit variance)
  - Otherwise the model could just make them have really large magnitude so the noise doesn’t change anything
Noise added to the word embeddings is not chosen randomly: it is chosen adversarially

Adversarial Examples