Semi-Supervised Learning for NLP

Semi-Supervised Learning for NLP. CS224n lecture 17.

Why use semi-supervised learning for NLP?

Why has deep learning been so successful recently?

  • Better “tricks” (dropout, improved optimizers (e.g., Adam), batch norm, attention)
  • Better hardware (thanks video games!) -> larger models
  • Larger datasets

NLP Dataset Sizes :

NLP datasets

Semi-Supervised Learning :

  • Use unlabeled examples during training
  • Easy to find for NLP!

    Semi-supervised learning algorithms

    We will cover three semi-supervised learning techniques :

  • Pre-training
    • One of the tricks that started to make NNs successful
    • You learned about this in week 1 (word2vec)!
  • Self-training
    • One of the oldest and simplest semi-supervised learning algorithms (1960s)
  • Consistency regularization

    • Recent idea (2014, lots of active research)
    • Great results for both computer vision and NLP


  1. First train an unsupervised model on unlabeled data
  2. Then incorporate the model’s learned weights into a supervised model and train it on the labeled data
    • Optional: continue fine-tuning the unsupervised weights.


Pre-training Example : Word2Vec

  • Shared part is word embeddings
  • No unsupervised-only part
  • Supervised-only part is the rest of the model
  • Unsupervised learning: skip-gram/cbow/glove/etc
  • Supervised learning: training on some NLP task

_Why does pre-training work?_

  • ”Smart” initialization for the model
  • More meaningful representations in the model
    • e.g., GloVe vectors capture a lot about word meaning, our model no longer has to learn the meanings itself

Pre-Training for NLP :

Pre-Training for NLP

Pre-Training Strategies: Auto-Encoder (Dai & Le, 2015)

  • For pre-training, train an autoencoder:
    • seq2seq model (without attention) where the target sequence is the input sequence
    • the encoder converts the input into a vector that contains enough information that the input can be recovered
  • Initialize the LSTM for a sentence classification model with the encoder

More :

Online Self-Training

Online Self-Training

note : convert model-produced label to one-hot vector, [0.1,0.5,0.4] =>[0,1,0]

Hard vs Soft Targets :

Consistency regularization

  • Train the model so a bit of noise doesn’t mess up its predictions
  • Equivalently, the model must give consistent predictions to nearby data points

Example :

Amazing Results for Computer Vision!

How to Apply Consistency Regularization to NLP?

Virtual Adversarial Training (Miyato et al., 2017)
  • Apply consistency regularization to text classification
    • First embed the words
    • Add the noise to the word embeddings
    • Have to constrain the word embeddings (e.g., make them have zero mean and unit variance)
      • Otherwise the model could just make them have really large magnitude so the noise doesn’t change anything
  • Noise added to the word embeddings is not chosen randomly: it is chosen adversarially

Adversarial Examples

Small (imperceptible to humans) tweak to neural network inputs can change its output.
Creating an adversarial example:

  • Compute the gradient of the loss with respect to the input
  • Add epsilon times the gradient to the input
    • Possibly repeat multiple times

Word Dropout

Cross-View Consistency (Clark et al., 2018)