Sticking to the sequential model limits the freedom you have with regards to creating models in Tensorflow. In this post, I’ll start by showing a simple sequential model. I’ll then convert it into a functional model. Finally, I’ll use GradientTape to calculate all gradients and control the full train loop, all to improve the possibilities in creating your models.
I’ll be using a simple feed-forward network to illustrate the examples. Let’s start with the sequential model. To keep the number of weights small, I’ll add two layers with one neuron each. Finally, the model will be trained to predict the same number as was input, and I’ll generate random data. If you’re unfamiliar with shapes, I have a few posts on this topic: intro and a closer look.
Sequential Model
import tensorflow as tf # define the sequential model # it contains two dense layers with one neuron each model = tf.keras.Sequential([ tf.keras.layers.Dense(1, input_shape=(1,)), tf.keras.layers.Dense(1) ]) # compile the model in order to define the loss function model.compile(loss="mse") # define the data as 100000 random numbers data = tf.random.uniform((100000,1)) # train the model to predict the same number model.fit(x=data, y=data, epochs=2) # print the weights for l in model.weights: print(f"{l.name}: {l.numpy()}") print(model.weights[0].name)
Train on 100000 samples Epoch 1/2 100000/100000 [==============================] - 2s 17us/sample - loss: 0.0365 Epoch 2/2 100000/100000 [==============================] - 2s 15us/sample - loss: 3.4097e-06 dense/kernel:0: [[-0.6452949]] dense/bias:0: [0.00095751] dense_1/kernel:0: [[-1.5518339]] dense_1/bias:0: [0.00247312] dense/kernel:0
We can see that the input weights for the two neurons are -0.6 and -1.55, respectively, and the biases are 0.0009 and 0.002. As humans, we can immediately tell that we could make the perfect model for the job. We can set both neuron weights to 1 and the biases to 0. Our model did the right thing in keeping biases small. The input weights it picked are also valid, since -0.6452949*-1.5518339=1.0013, which is close enough to 1.
Let’s notice how much Tensorflow manages for us. We haven’t set the batch size up, it defaults to 32. The fit function finds the relevant function based on the “mse” string we gave it, obtains a prediction from the input, calculates the loss and backpropagates the error.
Functional Model
The functional model gives us more flexibility in accessing the layers. Among other benefits, it allows us to define multiple inputs and outputs. We can try to force both neurons to learn weights close to 1 by constraining the output of the first layer to also be close to the ground truth.
import tensorflow as tf a = tf.keras.layers.Input(shape=(1,)) dense_1 = tf.keras.layers.Dense(1)(a) dense_2 = tf.keras.layers.Dense(1)(dense_1) model = tf.keras.Model(inputs=a, outputs=[dense_1, dense_2]) model.compile(loss="mse") data = tf.random.uniform((100000,1)) model.fit(x=data, y=[data, data], epochs=2) for l in model.weights: print(f"{l.name}: {l.numpy()}") print(model.weights[0].name)
Train on 100000 samples Epoch 1/2 100000/100000 [==============================] - 2s 20us/sample - loss: 0.5872 - dense_loss: 0.3384 - dense_1_loss: 0.2488 Epoch 2/2 100000/100000 [==============================] - 2s 17us/sample - loss: 3.6310e-05 - dense_loss: 1.6687e-05 - dense_1_loss: 1.9623e-05 dense/kernel:0: [[0.9995292]] dense/bias:0: [-0.00030103] dense_1/kernel:0: [[0.9994892]] dense_1/bias:0: [-0.00027911] dense/kernel:0
While fitting, we get information about both losses, since we have two of them. As expected, the weights now look much more like what we wanted. The constraint on the first dense layer forced it to train towards weight 1, and the second layer adapted correspondingly. I hope this exemplifies the power of having more control over what we put into our model and what we can receive.
Alternatively, for the importance of multiple inputs, imagine a model that takes a picture and a sentence as inputs and outputs a small story combining the two. Using multiple inputs, we can use CNN layers for the image while embedding the text separately and passing it through a few RNN layers. The outputs can then be combined for a common result.
Intro to GradientTape
The functional model offers a good amount of flexibility, enough for most use cases. I still think it’s useful to know how to manually create and train a model. Among other benefits, it can allow for better debugging and organisation of your code. I’ll be writing a post about connectionist temporal classification soon, in which I expect to use this method.
We use GradientTape to create a context in which all operations on tensors are recorded. We register all operations that were performed on the input. In practice, this means that it calculates the partial derivatives of the trainable variables with respect to the loss. We correspondingly adjust the weights based on said partial derivatives. It’s much easier than it sounds, let’s see it in action.
import tensorflow as tf dense_1 = tf.keras.layers.Dense(1) dense_2 = tf.keras.layers.Dense(1) # our data data = tf.random.uniform((32*1000,1)) # define epochs, batch size EPOCHS = 2 BATCH_SIZE = 32 LR = 0.01 # we need to handle batches manually data = tf.reshape(data, (-1, BATCH_SIZE, 1)) # iterate through epochs for e in range(EPOCHS): print(f"Epoch {e+1}") # every batch individually for batch_index, batch in enumerate(data): with tf.GradientTape() as g: # use the model to perform the prediction inner = dense_1(batch) output = dense_2(inner) # compute the loss losses = tf.keras.losses.MSE(output, batch) loss = tf.reduce_mean(losses) # retrieve the gradients of the trainable variables wrt the loss gradients = g.gradient(loss, [dense_1.variables, dense_2.variables]) # print what's happening print(f"Batch {batch_index}/{len(data)}, Loss {loss.numpy()}"+" "*10, end="\r") for layer, lgrad in zip([dense_1.variables, dense_2.variables], gradients): for var, grad in zip(layer, lgrad): # apply the gradient var.assign_add(grad*-LR) print()
Epoch 1 Batch 999/1000, Loss 0.0004924628883600235 Epoch 2 Batch 999/1000, Loss 1.6632682218187256e-06
For this example, I have deliberately made life difficult for myself by defining and individually managing layers. This meant more work getting their gradients and applying them. In practice, one would normally create a model and use an optimiser to apply the gradients. Here’s the above code reworked:
import tensorflow as tf # define our model, no input needed class MyModel(tf.keras.Model): def __init__(self): super(MyModel, self).__init__() self.dense_1 = tf.keras.layers.Dense(1) self.dense_2 = tf.keras.layers.Dense(1) def call(self, x): inner = self.dense_1(x) output = self.dense_2(inner) return output # initialise our model model = MyModel() # our data data = tf.random.uniform((32*1000,1)) # define epochs, batch size EPOCHS = 2 BATCH_SIZE = 32 LR = 0.001 # we need to handle batches manually data = tf.reshape(data, (-1, BATCH_SIZE, 1)) # easiest way to apply gradients is using an optimiser adam = tf.keras.optimizers.Adam(learning_rate=LR) # iterate through epochs for e in range(EPOCHS): print(f"Epoch {e+1}") # every batch individually for batch_index, batch in enumerate(data): with tf.GradientTape() as g: # use the model to perform the prediction output = model(batch) # compute the loss losses = tf.keras.losses.MSE(output, batch) loss = tf.reduce_mean(losses) # retrieve the gradients of the trainable variables wrt the loss gradients = g.gradient(loss, model.trainable_variables) # print what's happening print(f"Batch {batch_index}/{len(data)}, Loss {loss.numpy()}"+" "*10, end="\r") # use adam to apply the gradients adam.apply_gradients(zip(gradients, model.variables)) print()
Epoch 1 Batch 999/1000, Loss 6.90423917149019e-07 Epoch 2 Batch 999/1000, Loss 4.2168525615782215e-13
Doing things this way offers the most flexibility, but is much more prone to errors. Exploring it in depth would be a post by itself, which I hope to write soon. However, I hope it was useful to see the many different ways we can approach tasks in tensorflow.
Thank you for reading! I hope this post gave you some insight into the different approaches to creating our models in Tensorflow. Different tasks have different levels of complexity, and it is important to know what to pick for the job.
Pingback: The effect of Dataset Poisoning on Model Accuracy - Felix Gravila