Creating a custom training loop in tensorflow

Photo by **Brett Sayles** from **Pexels**

Sticking to the sequential model limits the freedom you have with regards to creating models in Tensorflow. In this post, I’ll start by showing a simple sequential model. I’ll then convert it into a functional model. Finally, I’ll use GradientTape to calculate all gradients and control the full train loop, all to improve the possibilities in creating your models.

I’ll be using a simple feed-forward network to illustrate the examples. Let’s start with the sequential model. To keep the number of weights small, I’ll add two layers with one neuron each. Finally, the model will be trained to predict the same number as was input, and I’ll generate random data. If you’re unfamiliar with shapes, I have a few posts on this topic: intro and a closer look.

Sequential Model

import tensorflow as tf

# define the sequential model
# it contains two dense layers with one neuron each
model = tf.keras.Sequential([
    tf.keras.layers.Dense(1, input_shape=(1,)),
    tf.keras.layers.Dense(1)
])
# compile the model in order to define the loss function
model.compile(loss="mse")

# define the data as 100000 random numbers
data = tf.random.uniform((100000,1))
# train the model to predict the same number
model.fit(x=data, y=data, epochs=2)

# print the weights
for l in model.weights:
    print(f"{l.name}: {l.numpy()}")
print(model.weights[0].name)

Train on 100000 samples
Epoch 1/2
100000/100000 [==============================] - 2s 17us/sample - loss: 0.0365
Epoch 2/2
100000/100000 [==============================] - 2s 15us/sample - loss: 3.4097e-06
dense/kernel:0: [[-0.6452949]]
dense/bias:0: [0.00095751]
dense_1/kernel:0: [[-1.5518339]]
dense_1/bias:0: [0.00247312]
dense/kernel:0

We can see that the input weights for the two neurons are -0.6 and -1.55, respectively, and the biases are 0.0009 and 0.002. As humans, we can immediately tell that we could make the perfect model for the job. We can set both neuron weights to 1 and the biases to 0. Our model did the right thing in keeping biases small. The input weights it picked are also valid, since -0.6452949*-1.5518339=1.0013, which is close enough to 1.

Let’s notice how much Tensorflow manages for us. We haven’t set the batch size up, it defaults to 32. The fit function finds the relevant function based on the “mse” string we gave it, obtains a prediction from the input, calculates the loss and backpropagates the error.

Functional Model

The functional model gives us more flexibility in accessing the layers. Among other benefits, it allows us to define multiple inputs and outputs. We can try to force both neurons to learn weights close to 1 by constraining the output of the first layer to also be close to the ground truth.

import tensorflow as tf


a = tf.keras.layers.Input(shape=(1,))
dense_1 = tf.keras.layers.Dense(1)(a)
dense_2 = tf.keras.layers.Dense(1)(dense_1)

model = tf.keras.Model(inputs=a, outputs=[dense_1, dense_2])

model.compile(loss="mse")

data = tf.random.uniform((100000,1))
model.fit(x=data, y=[data, data], epochs=2)

for l in model.weights:
    print(f"{l.name}: {l.numpy()}")
print(model.weights[0].name)

Train on 100000 samples
Epoch 1/2
100000/100000 [==============================] - 2s 20us/sample - loss: 0.5872 - dense_loss: 0.3384 - dense_1_loss: 0.2488
Epoch 2/2
100000/100000 [==============================] - 2s 17us/sample - loss: 3.6310e-05 - dense_loss: 1.6687e-05 - dense_1_loss: 1.9623e-05
dense/kernel:0: [[0.9995292]]
dense/bias:0: [-0.00030103]
dense_1/kernel:0: [[0.9994892]]
dense_1/bias:0: [-0.00027911]
dense/kernel:0

While fitting, we get information about both losses, since we have two of them. As expected, the weights now look much more like what we wanted. The constraint on the first dense layer forced it to train towards weight 1, and the second layer adapted correspondingly. I hope this exemplifies the power of having more control over what we put into our model and what we can receive.

Alternatively, for the importance of multiple inputs, imagine a model that takes a picture and a sentence as inputs and outputs a small story combining the two. Using multiple inputs, we can use CNN layers for the image while embedding the text separately and passing it through a few RNN layers. The outputs can then be combined for a common result.

Intro to GradientTape

The functional model offers a good amount of flexibility, enough for most use cases. I still think it’s useful to know how to manually create and train a model. Among other benefits, it can allow for better debugging and organisation of your code. I’ll be writing a post about connectionist temporal classification soon, in which I expect to use this method.

We use GradientTape to create a context in which all operations on tensors are recorded. We register all operations that were performed on the input. In practice, this means that it calculates the partial derivatives of the trainable variables with respect to the loss. We correspondingly adjust the weights based on said partial derivatives. It’s much easier than it sounds, let’s see it in action.

import tensorflow as tf

dense_1 = tf.keras.layers.Dense(1)
dense_2 = tf.keras.layers.Dense(1)

# our data
data = tf.random.uniform((32*1000,1))

# define epochs, batch size
EPOCHS = 2
BATCH_SIZE = 32
LR = 0.01

# we need to handle batches manually
data = tf.reshape(data, (-1, BATCH_SIZE, 1))

# iterate through epochs
for e in range(EPOCHS):
    print(f"Epoch {e+1}")
    # every batch individually
    for batch_index, batch in enumerate(data):
        with tf.GradientTape() as g:
            # use the model to perform the prediction
            inner = dense_1(batch)
            output = dense_2(inner)
            # compute the loss
            losses = tf.keras.losses.MSE(output, batch)
            loss = tf.reduce_mean(losses)
        
        # retrieve the gradients of the trainable variables wrt the loss
        gradients = g.gradient(loss, [dense_1.variables, dense_2.variables])
    
        # print what's happening
        print(f"Batch {batch_index}/{len(data)}, Loss {loss.numpy()}"+" "*10, end="\r")
        
        for layer, lgrad in zip([dense_1.variables, dense_2.variables], gradients):
            for var, grad in zip(layer, lgrad):
                # apply the gradient
                var.assign_add(grad*-LR)
    print()

Epoch 1
Batch 999/1000, Loss 0.0004924628883600235           
Epoch 2
Batch 999/1000, Loss 1.6632682218187256e-06

For this example, I have deliberately made life difficult for myself by defining and individually managing layers. This meant more work getting their gradients and applying them. In practice, one would normally create a model and use an optimiser to apply the gradients. Here’s the above code reworked:

import tensorflow as tf

# define our model, no input needed
class MyModel(tf.keras.Model):
    def __init__(self):
        super(MyModel, self).__init__()
        self.dense_1 = tf.keras.layers.Dense(1)
        self.dense_2 = tf.keras.layers.Dense(1)
    
    def call(self, x):
        inner = self.dense_1(x)
        output = self.dense_2(inner)
        return output

# initialise our model
model = MyModel()

# our data
data = tf.random.uniform((32*1000,1))

# define epochs, batch size
EPOCHS = 2
BATCH_SIZE = 32
LR = 0.001

# we need to handle batches manually
data = tf.reshape(data, (-1, BATCH_SIZE, 1))

# easiest way to apply gradients is using an optimiser
adam = tf.keras.optimizers.Adam(learning_rate=LR)

# iterate through epochs
for e in range(EPOCHS):
    print(f"Epoch {e+1}")
    # every batch individually
    for batch_index, batch in enumerate(data):
        with tf.GradientTape() as g:
            # use the model to perform the prediction
            output = model(batch)
            # compute the loss
            losses = tf.keras.losses.MSE(output, batch)
            loss = tf.reduce_mean(losses)
        
        # retrieve the gradients of the trainable variables wrt the loss
        gradients = g.gradient(loss, model.trainable_variables)
    
        # print what's happening
        print(f"Batch {batch_index}/{len(data)}, Loss {loss.numpy()}"+" "*10, end="\r")
        
#         use adam to apply the gradients
        adam.apply_gradients(zip(gradients, model.variables))
    print()

Epoch 1
Batch 999/1000, Loss 6.90423917149019e-07            
Epoch 2
Batch 999/1000, Loss 4.2168525615782215e-13

Doing things this way offers the most flexibility, but is much more prone to errors. Exploring it in depth would be a post by itself, which I hope to write soon. However, I hope it was useful to see the many different ways we can approach tasks in tensorflow.

Thank you for reading! I hope this post gave you some insight into the different approaches to creating our models in Tensorflow. Different tasks have different levels of complexity, and it is important to know what to pick for the job.

Creating a custom training loop in tensorflow

Sequential Model

Functional Model

Intro to GradientTape

1 thought on “Creating a custom training loop in tensorflow”

Leave a ReplyCancel reply