Generative Models ：PixelRNN and PixelCNN ； Variational Autoencoders (VAE) ； Generative Adversarial Networks (GAN)

## Generative Models

Given training data, generate new samples from same distribution:

Several flavors:

- Explicit density estimation: explicitly define and solve for $p_{model}(x)$
- Implicit density estimation: learn model that can sample from $p_{model}(x)$ w/o explicitly defining it

### Why Generative Models?

- Realistic samples for artwork, super-resolution, colorization, etc.
- Generative models of time-series data can be used for simulation and planning (reinforcement learning applications!)
- Training generative models can also enable inference of latent representations that can be useful as general features

## PixelRNN and PixelCNN

### PixelRNN

- Generate image pixels starting from corner
- Dependency on previous pixels modeled using an RNN (LSTM)

Drawback: sequential generation is slow!

### PixelCNN

- Still generate image pixels starting from corner
- Dependency on previous pixels now modeled using a CNN over context region

Training: maximize likelihood of training images

Training is faster than PixelRNN (can parallelize convolutions since context region values known from training images) Generation must still proceed sequentially => still slow

## Variational Autoencoders (VAE)

PixelCNNs define tractable density function, optimize likelihood of training data:

VAEs define intractable density function with latent z:

Cannot optimize directly, derive and optimize lower bound on likelihood instead.

### Background: Autoencoders

Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data:

How to learn this feature representation?

Train such that features can be used to reconstruct original data “Autoencoding” - encoding itself

L2 Loss function:

$$||x - \widetilde{x}||^2$$

### Variational Autoencoders

Assume training data $\{x^{(i)}\}_{i=1}^{N}$ is generated from underlying unobserved (latent) representation z :

_Intuition (remember from autoencoders!): x is an image, z is latent factors used to generate x: attributes, orientation, etc._

We want to estimate the true parameters of this generative model. How should we represent this model?

Choose prior p(z) to be simple, e.g. Gaussian. Conditional p(x|z) is complex (generates image) => represent with neural network.

How to train the model?

Remember strategy for training generative models from FVBNs. Learn model parameters to maximize likelihood of training data.

- $p_θ(z)$ : Simple Gaussian prior
- $p_θ(x|z)$ : Decoder neural network

Solution: In addition to decoder network modeling $p_θ(x|z)$, define additional encoder network $q_Ψ(z|x)$ that approximates $p_θ(z|x)$ .This allows us to derive a **lower bound** on the data likelihood that is tractable, which we can optimize.

Putting it all together: maximizing the likelihood lower bound

#### Generating Data

Use decoder network. Now sample z from prior!

#### Summary

Probabilistic spin to traditional autoencoders => allows generating data Defines an intractable density => derive and optimize a (variational) lower bound

- Pros: - Principled approach to generative models - Allows inference of q(z|x), can be useful feature representation for other tasks
- Cons: - Maximizes lower bound of likelihood: okay, but not as good evaluation as PixelRNN/PixelCNN - Samples blurrier and lower quality compared to state-of-the-art (GANs)

## GAN

Don’t work with any explicit density function! Instead, take game-theoretic approach: learn to generate from training distribution through 2-player game.

Generator network: try to fool the discriminator by generating real-looking images

Discriminator network: try to distinguish between real and fake images

### Training GANs: Two-player game

- Discriminator $θ_d$ wants to maximize objective such that D(x) is close to 1 (real) and D(G(z)) is close to 0 (fake)
- Generator $θ_g$ wants to minimize objective such that D(G(z)) is close to 1 (discriminator is fooled into thinking generated G(z) is real)

Alternate between:

- Gradient ascent on discriminator
- Gradient descent on generator

_In practice, optimizing this generator objective does not work well!_

Instead of minimizing likelihood of discriminator being correct, now maximize likelihood of discriminator being wrong.

Gradient ascent on generator, different objective:

Putting it together: GAN training algorithm

After training, use generator network to generate new images.

### Summary

Don’t work with an explicit density function.

Take game-theoretic approach: learn to generate from training distribution through 2-player game

Pros:

- Beautiful, state-of-the-art samples!

Cons:

- Trickier / more unstable to train
- Can’t solve inference queries such as p(x), p(z|x)

Active areas of research:

- Better loss functions, more stable training (Wasserstein GAN, LSGAN, many others)
- Conditional GANs, GANs for all kinds of applications

## Recap

Generative Models

- PixelRNN and PixelCNN : Explicit density model, optimizes exact likelihood, good samples. But inefficient sequential generation.
Variational Autoencoders (VAE) : Optimize variational lower bound on likelihood. Useful latent representation, inference queries. But current sample quality not the best.

Generative Adversarial Networks (GANs) : Game-theoretic approach, best samples! But can be tricky and unstable to train, no inference queries.