How to structure your model in TensorFlow

## Overall structure of a model in TensorFlow

#### Assemble graph

- Define placeholders for input and output
- Define the weights
- Define the inference model
- Define loss function
- Define optimizer
#### Compute

## Word2Vec

Skip-gram vs CBOW (Continuous Bag-of-Words)

Algorithmically, these models are similar, except that CBOW predicts center words from context words, while the skip-gram does the inverse and predicts source context-words from the center words. For example, if we have the sentence: “”The quick brown fox jumps””, then CBOW tries to predict “”brown”” from “”the””, “”quick””, “”fox””, and “”jumps””, while skip-gram tries to predict “”the””, “”quick””, “”fox””, and “”jumps”” from “”brown””.

Statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.

Vector representations of words projected on a 3D space:

#### Word2Vec Tutorial - The Skip-Gram Model

_More details in The Skip-Gram Model_

## How to structure your TensorFlow model

#### Phase 1: assemble your graph

- Define placeholders for input and output
- Define the weights
- Define the inference model
- Define loss function
- Define optimizer

#### Phase 2: execute the computation

Which is basically training your model. There are a few steps:

- Initialize all model variables for the first time.
- Feed in the training data. Might involve randomizing the order of data samples.
- Execute the inference model on the training data, so it calculates for each training input

example the output with the current model parameters. - Compute the cost
- Adjust the model parameters to minimize/maximize the cost depending on the model.

Let’s apply these steps to creating our word2vec, skip-gram model.

### Phase 1: Assemble the graph

#### 1. Define placeholders for input and output

Input is the center word and output is the target (context) word. Instead of using one-hot

vectors, we input the index of those words directly. For example, if the center word is the 1001th

word in the vocabulary, we input the number 1001.

Each sample input is a scalar, the placeholder for BATCH_SIZE sample inputs with have shape

[BATCH_SIZE].

Similar, the placeholder for BATCH_SIZE sample outputs with have shape [BATCH_SIZE].训练时模型输出为one-hot vector，表示与input word在同一个windows里的context word;而当预测时，输出是概率分布。此处简化为index of context word.

1 | center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE]) |

_Note that our center_words and target_words being fed in are both scalars – we feed in their

corresponding indices in our vocabulary._

#### 2. Define the weight (in this case, embedding matrix)

Each row corresponds to the representation vector of one word. If one word is represented with

a vector of size EMBED_SIZE, then the embedding matrix will have shape [VOCAB_SIZE,

EMBED_SIZE]. We initialize the embedding matrix to value from a random distribution. In this

case, let’s choose uniform distribution(均匀分布).

1 | embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0)) |

#### 3. Inference (compute the forward path of the graph)

Our goal is to get the vector representations of words in our dictionary. Remember that the

embed_matrix has dimension VOCAB_SIZE x EMBED_SIZE, with each row of the embedding

matrix corresponds to the vector representation of the word at that index. So to get the

representation of all the center words in the batch, we get the slice of all corresponding rows in

the embedding matrix. TensorFlow provides a convenient method to do so called

tf.nn.embedding_lookup().

1 | tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None, |

This method is really useful when it comes to matrix multiplication with one-hot vectors because

it saves us from doing a bunch of unnecessary computation that will return 0 anyway. An

illustration from Chris McCormick for multiplication of a one-hot vector with a matrix.

So, to get the embedding (or vector representation) of the input center words, we use this:

1 | embed = tf.nn.embedding_lookup(embed_matrix, center_words) |

_More details about tf.nn.embedding_lookup in https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup_

#### 4. Define the loss function

While NCE is cumbersome to implement in pure Python, TensorFlow already implemented it for

us.

1 | tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, num_true=1, |

**Note that by the way the function is implemented, the third argument is actually inputs, and thefourth is labels. **

For nce_loss, we need weights and biases for the hidden layer to calculate NCE loss.

1 | nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE], |

#### 5. Define optimizer

1 | optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss) |

### Phase 2: Execute the computation

We will create a session then within the session, use the good old feed_dict to feed inputs and

outputs into the placeholders, run the optimizer to minimize the loss, and fetch the loss value to

report back to us.

1 | with tf.Session() as sess: |

#### Name Scope域

Let’s give the tensors name and see how our model looks like in TensorBoard.

This doesn’t look very readable, as you can see in the graph, the nodes are scattering all over.

TensorBoard doesn’t know which nodes are similar to which nodes and should be grouped

together. This setback can grow to be extremely daunting when you build complex models with

hundreds of ops.

Then, how can we tell TensorBoard to know which nodes should be grouped together? For

example, we would like to group all ops related to input/output together, and all ops related to

NCE loss together. Thankfully, TensorFlow lets us do that with name scope. You can just put all

the ops that you want to group together under the block:

1 | with tf.name_scope(name_of_that_scope): |

When you visualize that on TensorBoard, you will see your nodes are grouped into neat blocks:

You can click on the plus sign on top of each name scope block to see all the ops inside that

block.

Question: how do we make our model most easy to reuse?

Hint: take advantage of Python’s object-oriented-ness.

Answer: build our model as a class! ==>link

## More about Tensorboard

1 | with tf.name_scope("summaries"): |

Display:

_More details in http://blog.csdn.net/sinat_33761963/article/details/62433234_

## Saving and Restoring

The tf.train.Saver class provides methods for saving and restoring models. The tf.train.Saver constructor adds save and restore ops to the graph for all, or a specified list, of the variables in the graph. The Saver object provides methods to run these ops, specifying paths for the checkpoint files to write to or read from.

The saver will restore all variables already defined in your model.

TensorFlow saves variables in binary checkpoint files that, roughly speaking, map variable names to tensor values.

#### Saving variables

Create a Saver with tf.train.Saver() to manage all variables in the model. For example, the following snippet demonstrates how to call the tf.train.Saver.save method to save variables to a checkpoint file:

1 | # Create some variables. |

You number checkpoint filenames by passing a value to the optional global_step argument to save():

1 | saver.save(sess, 'my-model', global_step=0) ==> filename: 'my-model-0' |

Additionally, optional arguments to the Saver() constructor let you control the proliferation of checkpoint files on disk:

max_to_keep : indicates the maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent checkpoint files are kept.)

keep_checkpoint_every_n_hours: In addition to keeping the most recent max_to_keep checkpoint files, you might want to keep one checkpoint file for every N hours of training. This can be useful if you want to later analyze how a model progressed during a long training session. For example, passing keep_checkpoint_every_n_hours=2 ensures that you keep one checkpoint file for every 2 hours of training. The default value of 10,000 hours effectively disables the feature.

Note that you still have to call the save() method to save the model. Passing these arguments to the constructor will not save variables automatically for you.

#### Restoring variables

The tf.train.Saver object not only saves variables to checkpoint files, it also restores variables. Note that when you restore variables from a file you do not have to initialize them beforehand. For example, the following snippet demonstrates how to call the tf.train.Saver.restore method to restore variables from a checkpoint file:

1 | tf.reset_default_graph() |

#### Choosing which variables to save and restore

If you do not pass any arguments to tf.train.Saver(), the saver handles all variables in the graph. Each variable is saved under the name that was passed when the variable was created.

It is sometimes useful to explicitly specify names for variables in the checkpoint files. For example, you may have trained a model with a variable named “weights” whose value you want to restore into a variable named “params”.

It is also sometimes useful to only save or restore a subset of the variables used by a model. For example, you may have trained a neural net with five layers, and you now want to train a new model with six layers that reuses the existing weights of the five trained layers. You can use the saver to restore the weights of just the first five layers.

You can easily specify the names and variables to save or load by passing to the tf.train.Saver() constructor either of the following:

- A list of variables (which will be stored under their own names).
- A Python dictionary in which keys are the names to use and the values are the variables to manage.
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16tf.reset_default_graph()

# Create some variables.

v1 = tf.get_variable("v1", [3], initializer = tf.zeros_initializer)

v2 = tf.get_variable("v2", [5], initializer = tf.zeros_initializer)

# Add ops to save and restore only `v2` using the name "v2"

saver = tf.train.Saver({"v2": v2})

# Use the saver object normally after that.

with tf.Session() as sess:

# Initialize v1 since the saver will not.

v1.initializer.run()

saver.restore(sess, "/tmp/model.ckpt")

print("v1 : %s" % v1.eval())

print("v2 : %s" % v2.eval())

Notes:

- You can create as many Saver objects as you want if you need to save and restore different subsets of the model variables. The same variable can be listed in multiple saver objects; its value is only changed when the Saver.restore() method is run.
- If you only restore a subset of the model variables at the start of a session, you have to run an initialize op for the other variables. See tf.variables_initializer for more information.
- To inspect the variables in a checkpoint, you can use the inspect_checkpoint library, particularly the print_tensors_in_checkpoint_file function.
- By default, Saver uses the value of the tf.Variable.name property for each variable. However, when you create a Saver object, you may optionally choose names for the variables in the checkpoint files.

_More details in https://www.tensorflow.org/programmers_guide/saved_model_

## Why should we still learn gradients?

You’ve probably noticed that in all the models we’ve built so far, we haven’t taken a single

gradient. All we need to do is to build a forward pass and TensorFlow takes care of the

backward path for us. So, the question is: why should we still learn to take gradient? Why are

Chris Manning and Richard Socher making us take gradients of cross entropy and softmax?

Shouldn’t taking gradients by hands one day be as obsolete as trying to take square root by

hands since the invention of calculator?

Well, maybe. But for now, TensorFlow can take gradients for us, but it can’t give us intuition

about what functions to use. It doesn’t tell us if a function will suffer from exploding or vanishing

gradients. We still need to know about gradients to get an understanding of why a model works

while another doesn’t.

By default, the optimizer trains all the trainable variables its objective function depends on. If there are variables that you do not want to train, you can set the keyword trainable=False when you declare a variable. One example of a variable you don’t want to train is the variable global_step, a common variable you will see in many TensorFlow model to keep track of how many times you’ve run your model.

1 | global_step = tf.Variable(0, trainable=False, dtype=tf.int32) |

You can also ask your optimizer to take gradients of specific variables. You can also modify the gradients calculated by your optimizer.

1 | # create an optimizer. |

You can also prevent certain tensors from contributing to the calculation of the derivatives with respect to a specific loss with tf.stop_gradient.

1 | stop_gradient( input, name=None ) |

This is very useful in situations when you want to freeze certain variables during training. Here are some examples given by TensorFlow’s official documentation.

- When you train a GAN (Generative Adversarial Network) where no backprop should happen through the adversarial example generation process.
- The EM algorithm where the M-step should not involve backpropagation through the output of the E-step.

The optimizer classes automatically compute derivatives on your graph, but you can explicitly ask TensorFlow to calculate certain gradients with tf.gradients.

1 | tf.gradients( |

This method constructs symbolic partial derivatives of sum of ys w.r.t. x in xs. ys and xs are each a Tensor or a list of tensors. grad_ys is a list of Tensor, holding the gradients received by the ys. The list must be the same length as ys.

**Technical detail**: This is especially useful when training only parts of a model. For example, we can use tf.gradients() to take the derivative G of the loss w.r.t. to the middle layer. Then we use an optimizer to minimize the difference between the middle layer output M and M + G. This only updates the lower half of the network.