How to structure your model in TensorFlow
Overall structure of a model in TensorFlow
Assemble graph
- Define placeholders for input and output
- Define the weights
- Define the inference model
- Define loss function
- Define optimizer
Compute
Word2Vec
Skip-gram vs CBOW (Continuous Bag-of-Words)
Algorithmically, these models are similar, except that CBOW predicts center words from context words, while the skip-gram does the inverse and predicts source context-words from the center words. For example, if we have the sentence: “”The quick brown fox jumps””, then CBOW tries to predict “”brown”” from “”the””, “”quick””, “”fox””, and “”jumps””, while skip-gram tries to predict “”the””, “”quick””, “”fox””, and “”jumps”” from “”brown””.
Statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.
Vector representations of words projected on a 3D space:
Word2Vec Tutorial - The Skip-Gram Model
_More details in The Skip-Gram Model_
How to structure your TensorFlow model
Phase 1: assemble your graph
- Define placeholders for input and output
- Define the weights
- Define the inference model
- Define loss function
- Define optimizer
Phase 2: execute the computation
Which is basically training your model. There are a few steps:
- Initialize all model variables for the first time.
- Feed in the training data. Might involve randomizing the order of data samples.
- Execute the inference model on the training data, so it calculates for each training input
example the output with the current model parameters. - Compute the cost
- Adjust the model parameters to minimize/maximize the cost depending on the model.
Let’s apply these steps to creating our word2vec, skip-gram model.
Phase 1: Assemble the graph
1. Define placeholders for input and output
Input is the center word and output is the target (context) word. Instead of using one-hot
vectors, we input the index of those words directly. For example, if the center word is the 1001th
word in the vocabulary, we input the number 1001.
Each sample input is a scalar, the placeholder for BATCH_SIZE sample inputs with have shape
[BATCH_SIZE].
Similar, the placeholder for BATCH_SIZE sample outputs with have shape [BATCH_SIZE].训练时模型输出为one-hot vector,表示与input word在同一个windows里的context word;而当预测时,输出是概率分布。此处简化为index of context word.1
2center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE])
target_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE])
_Note that our center_words and target_words being fed in are both scalars – we feed in their
corresponding indices in our vocabulary._
2. Define the weight (in this case, embedding matrix)
Each row corresponds to the representation vector of one word. If one word is represented with
a vector of size EMBED_SIZE, then the embedding matrix will have shape [VOCAB_SIZE,
EMBED_SIZE]. We initialize the embedding matrix to value from a random distribution. In this
case, let’s choose uniform distribution(均匀分布).1
embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0))
3. Inference (compute the forward path of the graph)
Our goal is to get the vector representations of words in our dictionary. Remember that the
embed_matrix has dimension VOCAB_SIZE x EMBED_SIZE, with each row of the embedding
matrix corresponds to the vector representation of the word at that index. So to get the
representation of all the center words in the batch, we get the slice of all corresponding rows in
the embedding matrix. TensorFlow provides a convenient method to do so called
tf.nn.embedding_lookup().1
2tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None,
validate_indices=True, max_norm=None)
This method is really useful when it comes to matrix multiplication with one-hot vectors because
it saves us from doing a bunch of unnecessary computation that will return 0 anyway. An
illustration from Chris McCormick for multiplication of a one-hot vector with a matrix.
So, to get the embedding (or vector representation) of the input center words, we use this:1
embed = tf.nn.embedding_lookup(embed_matrix, center_words)
_More details about tf.nn.embedding_lookup in https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup_
4. Define the loss function
While NCE is cumbersome to implement in pure Python, TensorFlow already implemented it for
us.1
2
3tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, num_true=1,
sampled_values=None, remove_accidental_hits=False, partition_strategy='mod',
name='nce_loss')
Note that by the way the function is implemented, the third argument is actually inputs, and the
fourth is labels.
For nce_loss, we need weights and biases for the hidden layer to calculate NCE loss.1
2
3
4
5
6
7
8
9nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],
stddev=1.0 / EMBED_SIZE ** 0.5))
nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]))
loss = tf.reduce_mean(tf.nn.nce_loss( weights=nce_weight,
biases=nce_bias,
labels=target_words,
num_sampled=NUM_SAMPLED,
num_classes=VOCAB_SIZE))
5. Define optimizer
1 | optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss) |
Phase 2: Execute the computation
We will create a session then within the session, use the good old feed_dict to feed inputs and
outputs into the placeholders, run the optimizer to minimize the loss, and fetch the loss value to
report back to us.1
2
3
4
5
6
7
8
9
10
11with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
average_loss = 0.0
for index in xrange(NUM_TRAIN_STEPS):
batch = batch_gen.next()
loss_batch, _ = sess.run([loss, optimizer],
feed_dict={center_words: batch[0], target_words: batch[1]})
average_loss += loss_batch
if (index + 1) % 2000 == 0:
print('Average loss at step {}: {:5.1f}'.format(index + 1,
average_loss / (index + 1)))
Name Scope域
Let’s give the tensors name and see how our model looks like in TensorBoard.
This doesn’t look very readable, as you can see in the graph, the nodes are scattering all over.
TensorBoard doesn’t know which nodes are similar to which nodes and should be grouped
together. This setback can grow to be extremely daunting when you build complex models with
hundreds of ops.
Then, how can we tell TensorBoard to know which nodes should be grouped together? For
example, we would like to group all ops related to input/output together, and all ops related to
NCE loss together. Thankfully, TensorFlow lets us do that with name scope. You can just put all
the ops that you want to group together under the block:1
2
3
4with tf.name_scope(name_of_that_scope):
# declare op_1
# declare op_2
# ...
When you visualize that on TensorBoard, you will see your nodes are grouped into neat blocks:
You can click on the plus sign on top of each name scope block to see all the ops inside that
block.
Question: how do we make our model most easy to reuse?
Hint: take advantage of Python’s object-oriented-ness.
Answer: build our model as a class! ==>link
More about Tensorboard
1 | with tf.name_scope("summaries"): |
Display:
_More details in http://blog.csdn.net/sinat_33761963/article/details/62433234_
Saving and Restoring
The tf.train.Saver class provides methods for saving and restoring models. The tf.train.Saver constructor adds save and restore ops to the graph for all, or a specified list, of the variables in the graph. The Saver object provides methods to run these ops, specifying paths for the checkpoint files to write to or read from.
The saver will restore all variables already defined in your model.
TensorFlow saves variables in binary checkpoint files that, roughly speaking, map variable names to tensor values.
Saving variables
Create a Saver with tf.train.Saver() to manage all variables in the model. For example, the following snippet demonstrates how to call the tf.train.Saver.save method to save variables to a checkpoint file:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23# Create some variables.
v1 = tf.get_variable("v1", shape=[3], initializer = tf.zeros_initializer)
v2 = tf.get_variable("v2", shape=[5], initializer = tf.zeros_initializer)
inc_v1 = v1.assign(v1+1)
dec_v2 = v2.assign(v2-1)
# Add an op to initialize the variables.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables, do some work, and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Do some work with the model.
inc_v1.op.run()
dec_v2.op.run()
# Save the variables to disk.
save_path = saver.save(sess, "/tmp/model.ckpt")
print("Model saved in file: %s" % save_path)
You number checkpoint filenames by passing a value to the optional global_step argument to save():1
2
3saver.save(sess, 'my-model', global_step=0) ==> filename: 'my-model-0'
...
saver.save(sess, 'my-model', global_step=1000) ==> filename: 'my-model-1000'
Additionally, optional arguments to the Saver() constructor let you control the proliferation of checkpoint files on disk:
max_to_keep : indicates the maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent checkpoint files are kept.)
keep_checkpoint_every_n_hours: In addition to keeping the most recent max_to_keep checkpoint files, you might want to keep one checkpoint file for every N hours of training. This can be useful if you want to later analyze how a model progressed during a long training session. For example, passing keep_checkpoint_every_n_hours=2 ensures that you keep one checkpoint file for every 2 hours of training. The default value of 10,000 hours effectively disables the feature.
Note that you still have to call the save() method to save the model. Passing these arguments to the constructor will not save variables automatically for you.
Restoring variables
The tf.train.Saver object not only saves variables to checkpoint files, it also restores variables. Note that when you restore variables from a file you do not have to initialize them beforehand. For example, the following snippet demonstrates how to call the tf.train.Saver.restore method to restore variables from a checkpoint file:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18tf.reset_default_graph()
# Create some variables.
v1 = tf.get_variable("v1", shape=[3])
v2 = tf.get_variable("v2", shape=[5])
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, use the saver to restore variables from disk, and
# do some work with the model.
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "/tmp/model.ckpt")
print("Model restored.")
# Check the values of the variables
print("v1 : %s" % v1.eval())
print("v2 : %s" % v2.eval())
Choosing which variables to save and restore
If you do not pass any arguments to tf.train.Saver(), the saver handles all variables in the graph. Each variable is saved under the name that was passed when the variable was created.
It is sometimes useful to explicitly specify names for variables in the checkpoint files. For example, you may have trained a model with a variable named “weights” whose value you want to restore into a variable named “params”.
It is also sometimes useful to only save or restore a subset of the variables used by a model. For example, you may have trained a neural net with five layers, and you now want to train a new model with six layers that reuses the existing weights of the five trained layers. You can use the saver to restore the weights of just the first five layers.
You can easily specify the names and variables to save or load by passing to the tf.train.Saver() constructor either of the following:
- A list of variables (which will be stored under their own names).
- A Python dictionary in which keys are the names to use and the values are the variables to manage.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16tf.reset_default_graph()
# Create some variables.
v1 = tf.get_variable("v1", [3], initializer = tf.zeros_initializer)
v2 = tf.get_variable("v2", [5], initializer = tf.zeros_initializer)
# Add ops to save and restore only `v2` using the name "v2"
saver = tf.train.Saver({"v2": v2})
# Use the saver object normally after that.
with tf.Session() as sess:
# Initialize v1 since the saver will not.
v1.initializer.run()
saver.restore(sess, "/tmp/model.ckpt")
print("v1 : %s" % v1.eval())
print("v2 : %s" % v2.eval())
Notes:
- You can create as many Saver objects as you want if you need to save and restore different subsets of the model variables. The same variable can be listed in multiple saver objects; its value is only changed when the Saver.restore() method is run.
- If you only restore a subset of the model variables at the start of a session, you have to run an initialize op for the other variables. See tf.variables_initializer for more information.
- To inspect the variables in a checkpoint, you can use the inspect_checkpoint library, particularly the print_tensors_in_checkpoint_file function.
- By default, Saver uses the value of the tf.Variable.name property for each variable. However, when you create a Saver object, you may optionally choose names for the variables in the checkpoint files.
_More details in https://www.tensorflow.org/programmers_guide/saved_model_
Why should we still learn gradients?
You’ve probably noticed that in all the models we’ve built so far, we haven’t taken a single
gradient. All we need to do is to build a forward pass and TensorFlow takes care of the
backward path for us. So, the question is: why should we still learn to take gradient? Why are
Chris Manning and Richard Socher making us take gradients of cross entropy and softmax?
Shouldn’t taking gradients by hands one day be as obsolete as trying to take square root by
hands since the invention of calculator?
Well, maybe. But for now, TensorFlow can take gradients for us, but it can’t give us intuition
about what functions to use. It doesn’t tell us if a function will suffer from exploding or vanishing
gradients. We still need to know about gradients to get an understanding of why a model works
while another doesn’t.
By default, the optimizer trains all the trainable variables its objective function depends on. If there are variables that you do not want to train, you can set the keyword trainable=False when you declare a variable. One example of a variable you don’t want to train is the variable global_step, a common variable you will see in many TensorFlow model to keep track of how many times you’ve run your model.1
2global_step = tf.Variable(0, trainable=False, dtype=tf.int32)
learning_rate = 0.01 * 0.99 ** tf.cast(global_step, tf.float32)
You can also ask your optimizer to take gradients of specific variables. You can also modify the gradients calculated by your optimizer.1
2
3
4
5
6
7
8
9
10
11
12# create an optimizer.
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
# compute the gradients for a list of variables.
grads_and_vars = optimizer.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example, subtract each of them by 1.
subtracted_grads_and_vars = [(gv[0] - 1.0, gv[1]) for gv in grads_and_vars]
# ask the optimizer to apply the subtracted gradients.
optimizer.apply_gradients(subtracted_grads_and_vars)
You can also prevent certain tensors from contributing to the calculation of the derivatives with respect to a specific loss with tf.stop_gradient.1
stop_gradient( input, name=None )
This is very useful in situations when you want to freeze certain variables during training. Here are some examples given by TensorFlow’s official documentation.
- When you train a GAN (Generative Adversarial Network) where no backprop should happen through the adversarial example generation process.
- The EM algorithm where the M-step should not involve backpropagation through the output of the E-step.
The optimizer classes automatically compute derivatives on your graph, but you can explicitly ask TensorFlow to calculate certain gradients with tf.gradients.1
2
3
4
5
6
7
8
9
10tf.gradients(
ys,
xs,
grad_ys=None,
name='gradients',
colocate_gradients_with_ops=False,
gate_gradients=False,
aggregation_method=None,
stop_gradients=None
)
This method constructs symbolic partial derivatives of sum of ys w.r.t. x in xs. ys and xs are each a Tensor or a list of tensors. grad_ys is a list of Tensor, holding the gradients received by the ys. The list must be the same length as ys.
Technical detail: This is especially useful when training only parts of a model. For example, we can use tf.gradients() to take the derivative G of the loss w.r.t. to the middle layer. Then we use an optimizer to minimize the difference between the middle layer output M and M + G. This only updates the lower half of the network.