Generative Models :PixelRNN and PixelCNN ; Variational Autoencoders (VAE) ; Generative Adversarial Networks (GAN)
Generative Models
Given training data, generate new samples from same distribution:
Several flavors:
- Explicit density estimation: explicitly define and solve for $p_{model}(x)$
- Implicit density estimation: learn model that can sample from $p_{model}(x)$ w/o explicitly defining it
Why Generative Models?
- Realistic samples for artwork, super-resolution, colorization, etc.
- Generative models of time-series data can be used for simulation and planning (reinforcement learning applications!)
- Training generative models can also enable inference of latent representations that can be useful as general features
PixelRNN and PixelCNN
PixelRNN
- Generate image pixels starting from corner
- Dependency on previous pixels modeled using an RNN (LSTM)
Drawback: sequential generation is slow!
PixelCNN
- Still generate image pixels starting from corner
- Dependency on previous pixels now modeled using a CNN over context region
Training: maximize likelihood of training images
Training is faster than PixelRNN (can parallelize convolutions since context region values known from training images) Generation must still proceed sequentially => still slow
Variational Autoencoders (VAE)
PixelCNNs define tractable density function, optimize likelihood of training data:
VAEs define intractable density function with latent z:
Cannot optimize directly, derive and optimize lower bound on likelihood instead.
Background: Autoencoders
Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data:
How to learn this feature representation?
Train such that features can be used to reconstruct original data “Autoencoding” - encoding itself
L2 Loss function:
$$||x - \widetilde{x}||^2$$
Variational Autoencoders
Assume training data $\{x^{(i)}\}_{i=1}^{N}$ is generated from underlying unobserved (latent) representation z :
_Intuition (remember from autoencoders!): x is an image, z is latent factors used to generate x: attributes, orientation, etc._
We want to estimate the true parameters of this generative model. How should we represent this model?
Choose prior p(z) to be simple, e.g. Gaussian. Conditional p(x|z) is complex (generates image) => represent with neural network.
How to train the model?
Remember strategy for training generative models from FVBNs. Learn model parameters to maximize likelihood of training data.
- $p_θ(z)$ : Simple Gaussian prior
- $p_θ(x|z)$ : Decoder neural network
Solution: In addition to decoder network modeling $p_θ(x|z)$, define additional encoder network $q_Ψ(z|x)$ that approximates $p_θ(z|x)$ .This allows us to derive a lower bound on the data likelihood that is tractable, which we can optimize.
Putting it all together: maximizing the likelihood lower bound
Generating Data
Use decoder network. Now sample z from prior!
Summary
Probabilistic spin to traditional autoencoders => allows generating data Defines an intractable density => derive and optimize a (variational) lower bound
- Pros: - Principled approach to generative models - Allows inference of q(z|x), can be useful feature representation for other tasks
- Cons: - Maximizes lower bound of likelihood: okay, but not as good evaluation as PixelRNN/PixelCNN - Samples blurrier and lower quality compared to state-of-the-art (GANs)
GAN
Don’t work with any explicit density function! Instead, take game-theoretic approach: learn to generate from training distribution through 2-player game.
Generator network: try to fool the discriminator by generating real-looking images
Discriminator network: try to distinguish between real and fake images
Training GANs: Two-player game
- Discriminator $θ_d$ wants to maximize objective such that D(x) is close to 1 (real) and D(G(z)) is close to 0 (fake)
- Generator $θ_g$ wants to minimize objective such that D(G(z)) is close to 1 (discriminator is fooled into thinking generated G(z) is real)
Alternate between:
- Gradient ascent on discriminator
- Gradient descent on generator
_In practice, optimizing this generator objective does not work well!_
Instead of minimizing likelihood of discriminator being correct, now maximize likelihood of discriminator being wrong.
Gradient ascent on generator, different objective:
Putting it together: GAN training algorithm
After training, use generator network to generate new images.
Summary
Don’t work with an explicit density function.
Take game-theoretic approach: learn to generate from training distribution through 2-player game
Pros:
- Beautiful, state-of-the-art samples!
Cons:
- Trickier / more unstable to train
- Can’t solve inference queries such as p(x), p(z|x)
Active areas of research:
- Better loss functions, more stable training (Wasserstein GAN, LSGAN, many others)
- Conditional GANs, GANs for all kinds of applications
Recap
Generative Models
- PixelRNN and PixelCNN : Explicit density model, optimizes exact likelihood, good samples. But inefficient sequential generation.
Variational Autoencoders (VAE) : Optimize variational lower bound on likelihood. Useful latent representation, inference queries. But current sample quality not the best.
Generative Adversarial Networks (GANs) : Game-theoretic approach, best samples! But can be tricky and unstable to train, no inference queries.