This will be a multi-part tutorial and it will be published periodically. It will help guide you through the most basic concepts essential for the task of image recognition. At the end of the series, we will have developed an application that allows us to create a neural network in TensorFlow, which can be trained and able to recognize our own database of images.

To achieve that we will start our journey with some very simple example where the basic aspects of TensorFlow will be introduced, and we will advance in our knowledge until reaching the proposed objective.

This article has been elaborated with a compilation of different sources (manuals and blogs), as well as with knowledge acquired from my experience in the development of my own applications for different tasks in different areas. In the bibliography, we will refer to the various sources used.

Let’s dive in!

# Introduction

**What is TensorFlow?**

TensorFlow is an open source library for numerical calculation, using data flow graphs as a programming tool. The nodes in the graph represent mathematical operations, while the connections or links in the graph represent the multidimensional datasets (tensors).

With this library we are able, among other operations, to build and train neural networks to detect correlations and decipher patterns. Tensorflow is currently used both in research and for the production of Google products.

TensorFlow is Google Brain’s second generation machine learning system, released as open source software on November 9, 2015. While the reference implementation runs on isolated devices, TensorFlow can run on multiple CPUs and GPUs (with optional extensions of CUDA for general purpose computing on graphics processing units). TensorFlow is available in 64-bit Linux, macOS, and mobile platforms that include Android and iOS.

The TensorFlow computations are expressed as stateful dataflow graphs. The name TensorFlow derives from the operations that neural networks perform on multidimensional arrays of data. These multidimensional arrays are referred to as “tensors” (more details see https://www.tensorflow.org/).

## Handwrite recognition

In this example, we’ll learn how to recognize some handwritten characters.

We will start by implementing the simplest possible model. In this case, we will do a linear regression as the first model for the recognition of the characters treated as images.

First, we will proceed to load a set of images of the handwritten digits of the MNIST dataset, then proceed to define and optimize a mathematical model of linear regression in TensorFlow.

Note: Some basic knowledge about Python and some basic understanding of Machine Learning will help you here.

First, we will load some libraries.

1 2 3 4 5 |
%matplotlib inline import matplotlib.pyplot as plt import tensorflow as tf import numpy as np from sklearn.metrics import confusion_matrix |

Download the Data (Load Data)

The MNIST data set is approximately 12 MB and will be automatically downloaded if it is not in the given path.

1 2 3 |
# Load Data..... from tensorflow.examples.tutorials.mnist import input_data data = input_data.read_data_sets("data/MNIST/", one_hot=True) |

Let’s verify the data.

1 2 3 4 |
print("Size of:") print("- Training-set:\t\t{}".format(len(data.train.labels))) print("- Test-set:\t\t{}".format(len(data.test.labels))) print("- Validation-set:\t{}".format(len(data.validation.labels))) |

This should output:

Size of:

Training-set: 55000

Test-set: 10000

Validation-set: 5000

As we can see, now we have three subsets of data, one of training, one of test and another of validation.

## One-Hot Encoding

The dataset has been loaded with the Encoded One-Hot. This means that the labels have been converted from a single number to a vector whose length is equal to the number of possible classes.

For example, the One-Hot encoded tags for the first 5 images in the test set are:

1 |
data.test.labels[0:5, :] |

We get:

1 2 3 4 5 |
array([[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], [ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], [ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]]) |

As we can see, we have five vectors where each component has zero values except in the position of the component that identifies the class, whose value is 1.

Since we need classes as unique numbers for comparisons and performance measures, we proceed to convert these encoded vectors as One-Hot to a single number by taking the index of the highest element. Note that the word ‘class’ is a keyword used in Python, so we need to use the name ‘cls’ instead.

To encode these vectors to numbers:

1 |
data.test.cls = np.array([label.argmax() for label in data.test.labels]) |

Now we can see the class for the first five images in the test suite.

1 |
print (data.test.cls[0:5]) |

We get:

1 |
array([7, 2, 1, 0, 4, 1]) |

Let’s compare these with the One-Hot encoded vectors above. For example, the class for the first image is 7, which corresponds to a One-Hot encoded vector where all elements are zero except the element with index 7.

The next step is to define some variables that will be used in the code. These variables and their constant values will allow us to have a cleaner and easier to read code.

We define them as follows:

1 2 3 4 5 6 7 8 9 10 11 |
# We know that MNIST images are 28 pixels in each dimension. img_size = 28 # Images are stored in one-dimensional arrays of this length. img_size_flat = img_size * img_size # Tuple with height and width of images used to reshape arrays. img_shape = (img_size, img_size) # Number of classes, one class for each of 10 digits. num_classes = 10 |

Now we will create a function that is used to draw 9 images in a 3×3 grid and write the true and predicted classes under each image.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
def plot_images(images, cls_true, cls_pred=None): assert len(images) == len(cls_true) == 9 # Create figure with 3x3 sub-plots. fig, axes = plt.subplots(3, 3) fig.subplots_adjust(hspace=0.5, wspace=0.5) for i, ax in enumerate(axes.flat): # Plot image. ax.imshow(images[i].reshape(img_shape), cmap='binary') # Show true and predicted classes. if cls_pred is None: xlabel = "True: {0}".format(cls_true[i]) else: xlabel = "True: {0}, Pred: {1}".format(cls_true[i], cls_pred[i]) ax.set_xlabel(xlabel) # Remove ticks from the plot. ax.set_xticks([]) ax.set_yticks([]) |

Let’s draw some images to see if the data is correct

1 2 3 4 5 6 7 8 |
# Get the first images from the test-set. images = data.test.images[0:9] # Get the true classes for those images. cls_true = data.test.cls[0:9] # Plot the images and labels using our helper-function above. plot_images(images=images, cls_true=cls_true) |

The purpose of using the TensorFlow library is to generate a computational graph that can be executed much more efficiently. TensorFlow can be more efficient than NumPy (in some cases), since TensorFlow knows the entire calculation graph and its data flow that must be executed, while NumPy only knows the calculation of the mathematical operation that is running at a certain moment.

TensorFlow can also automatically calculate the gradients that are needed to optimize the variables of the graph in order for the model to work better. This is because the graph is a combination of simple mathematical expressions, so the gradient of the entire graph can be calculated using the string rule for the calculation of the derivatives when optimizing the cost function.

A graph of TensorFlow in general consists of the following parts:

- The placeholder variables (Placeholder variables) used to change the entries (data) to the graph (links between the nodes).
- The variables of the model.
- The model, which is essentially a mathematical function that calculates the results given the input in the variables of the placeholder (Placeholder variables) and the variables of the model (remember that from the point of view of TensorFlow the mathematical operations are treated as nodes of the graph).
- A cost measure that can be used to guide the optimization of the variables.
- An optimization method that updates the model variables.

In addition, the TensorFlow graph can also contain several debug statements, for example, so that the log data is displayed using TensorBoard, which we won’t be covering on this tutorial.

The variables of placeholder (Placeholder variables) serve as input to the graph, and that we can change as we execute operations on the graph.

Let’s define the placeholder variables for the input images (Placeholder variables), which we will call x. Doing this allows us to change the images that are entered into the TensorFlow graph.

The type of data that are introduced in the graph are multidimensional vectors or matrices (denoted tensors). These tensors are multidimensional arrays, whose form is [None, img_size_flat], where None means that the tensor can contain an arbitrary number of images, each image being a vector of length img_size_flat.

we define:

1 |
x = tf.placeholder(tf.float32, [None, img_size_flat]) |

Note that in the function tf.placeholder, we have to define the data type, which in this case is a float32.

Next, we define the placeholder variable for the true labels associated with the images that were entered in the placeholder variable x. The shape of this placeholder variable is [None, num_classes], which means that it can contain an arbitrary amount of labels and each label is a vector of the length of num_classes, which is 10 in our case.

1 |
y_true = tf.placeholder(tf.float32, [None, num_classes]) |

Finally, we have the placeholder variable for the true class of each image in the placeholder variable x. These are integers and the dimensionality of this placeholder variable is pre-defined as [None], which means that the placeholder variable is a one-dimensional vector of arbitrary length.

1 |
y_true_cls = tf.placeholder(tf.int64, [None]) |

## Model

As we have indicated previously, in this example we are going to use a simple mathematical model of linear regression, that is, we are going to define a linear function where the images are multiplied in the placeholder variable \(x\) by a variable \(w\) that we will call weights and then add a bias (bias) that we will call \(b\).

So:

\( logist=wx+b\)

The result is a formed matrix [num_images, num_classes], given that \(x\) has the form [num_images, img_size_flat] and the weight matrix \(w\) they have the form [img_size_flat, num_classes], so the multiplication of these two matrices is a matrix whose form is [num_images, num_classes]. Then the vector of biases \(b\) is added to each row of that resulting matrix.

Note we have used the name \(logits\) to respect the typical terminology of TensorFlow, but you can call the variable in another way.

We define this operation in TensorFlow in the following way:

- First, we declare the variable \(w\) (weights tensor) and initialize it with zero:

1 |
w = tf.Variable(tf.zeros([img_size_flat, num_classes])) |

Then we declare the variable \(b\) (bias tensor) and initialize it with zero:

1 |
b = tf.Variable(tf.zeros([ num_classes])) |

We define our linear model:

1 |
logits = tf.matmul(x, w) + b |

In the definition of the \(logits\) model, we have used the tf.matmul function. This function returns the value of multiplying the tensor \(x\) by the tensor \(w\).

The logits model is a matrix with rows **num_images **and columns **num_classes**, where the element of row \(i\) and column \(j\) is an estimate of the probability that the input image \(j\) is of the \(j\) class.

However, these estimates are a bit difficult to interpret, given that the numbers obtained can be very small or very large. The next step then would be to normalize the values so that, so that each row of the \(logits\) matrix all its values add one, so the value of each element of the matrix is restricted between zero and one. With TensorFlow, this is calculated using the function called **softmax **and the result is stored in a new variable **y_pred**.

1 |
y_pred = tf.nn.softmax(logits) |

Finally, the predicted class can be calculated from the **y_pred **matrix by taking the index of the largest element in each row.

1 |
y_pred_cls = tf.argmax(y_pred, dimension=1) |

## Cost and Optimization Function

As we have indicated previously, the model of classification and recognition of the digits written by hand that we have implemented, is a mathematical model of linear regression \(logits = wx + b\). The prediction quality of the model will depend on the optimal values of the variables \(w\) (a tensor of weights) and

\( b\) (tensor of biases) data an entry \(x\) (a tensor of images). Therefore, optimizing our classifier for the task of digit recognition consists of making an adjustment of our model in such a way that we can find the optimal values in the tensor \(w\) and bb. This process of optimization in these variables is what is known as the training or learning process of the model.

To improve the model by classifying the input images, we must somehow find a method to change the value of the variables for the weights (\(w\)) and the biases (\(b\)). To do this, we first need to know how well the model actually works by comparing the predicted result of the **y_pred **model with the desired result **y_true**. The performance function that measures the error between the actual output of the system that is intended to model and the output of the estimator core (the model), is what is known as the cost function. Different cost functions can be defined.

Cross-entropy is a measure of performance used in classification.

Cross-entropy is a continuous function that is always positive and if the predicted output of the model exactly matches the desired output, then the cross-entropy is equal to zero. Therefore, the objective of the optimization is to minimize the cross-entropy so that it is as close as possible to zero by changing the \(w\) weights and the \(b\) biases of the model.

TensorFlow has a built-in function to calculate cross entropy. Note that the values of the \(logits\) are used since this function of the TensorFlow calculates the softmax internally.

1 |
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=y_true) |

Once we have calculated the cross-entropy for each of the image classifications, we have a measure of how well the model behaves in each image individually. But to use cross-entropy to guide the optimization of model variables, we need to have a single scalar value, so we simply take the average of the entropy cross for all image classifications.

For this:

1 |
cost = tf.reduce_mean(cross_entropy) |

## Optimization method

Now that we have a cost measure that should be minimized, we can create an optimizer. In this case, we will use one of the most widely used methods known as Gradient Descent (for more details check lecture 4, lecture 5 and lecture 6), where the step size for adjusting the variables is prefixed by 0.5.

Keep in mind that optimization is not done at this time. In fact, nothing is calculated at all, we simply add the optimizer object to the TensorFlow graphic for later execution.

1 |
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(cost) |

## Performance measures

We need some more performance measures to show the progress to the user. We create a boolean vector, where we verify if the predicted class is equal to the true class of each image.

1 |
correct_prediction = tf.equal(y_pred_cls, y_true_cls) |

This calculates the accuracy of the classification and transforms booleans to floats, so False becomes 0 and True becomes 1. Then we calculate the average of these numbers.

1 |
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) |

## Run TensorFlow

Once all the elements of our model have been specified, we can now create the graph. For this we have to create a session to execute the graph:

1 |
session = tf.Session() |

Initialize variables: The variables for weights and biases must be initialized before beginning to optimize them.

1 |
session.run(tf.global_variables_initializer()) |

By having 50,000 images in the training set, it can take a long time to calculate the gradient of the model using all of these images during the optimization process. Therefore, we use a Stochastic Gradient Descent that only uses one batch (batch) of images in randomly selected each iteration of the optimizer. This allows the learning process to be faster.

We create a function to perform several iterations of optimization to gradually improve the weights \(w\) and the biases \(b\) of the model. At each iteration, a new batch of data is selected from the training set and then TensorFlow runs the optimizer using those training samples. We set that batch of images em 100 (batch_size = 100).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# bacth of images batch_size = 100 def optimize(num_iterations): for i in range(num_iterations): # Get a batch of training examples. # x_batch now holds a batch of images and # y_true_batch are the true labels for those images. x_batch, y_true_batch = data.train.next_batch(batch_size) # Put the batch into a dict with the proper names # for placeholder variables in the TensorFlow graph. # Note that the placeholder for y_true_cls is not set # because it is not used during training. feed_dict_train = {x: x_batch, y_true: y_true_batch} # Run the optimizer using this batch of training data. # TensorFlow assigns the variables in feed_dict_train # to the placeholder variables and then runs the optimizer. session.run(optimizer, feed_dict=feed_dict_train) |

## Helper-functions to show performance

We will create a set of functions that will help us monitor the performance of our classifier. First, we create a dictionary with the data from the test set that will be used as input to the TensorFlow graph.

1 2 3 |
feed_dict_test = {x: data.test.images, y_true: data.test.labels, y_true_cls: data.test.cls} |

Function to print the classification accuracy in the test set.

1 2 3 4 5 6 |
def print_accuracy(): # Use TensorFlow to compute the accuracy. acc = session.run(accuracy, feed_dict=feed_dict_test) # Print the accuracy. print("Accuracy on test-set: {0:.1%}".format(acc)) |

Function to print and trace the confusion matrix using scikit-learn.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
def print_confusion_matrix(): # Get the true classifications for the test-set. cls_true = data.test.cls # Get the predicted classifications for the test-set. cls_pred = session.run(y_pred_cls, feed_dict=feed_dict_test) # Get the confusion matrix using sklearn. cm = confusion_matrix(y_true=cls_true, y_pred=cls_pred) # Print the confusion matrix as text. print(cm) # Plot the confusion matrix as an image. plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues) # Make various adjustments to the plot. plt.tight_layout() plt.colorbar() tick_marks = np.arange(num_classes) plt.xticks(tick_marks, range(num_classes)) plt.yticks(tick_marks, range(num_classes)) plt.xlabel('Predicted') plt.ylabel('True') |

Function to plot the weights of the model. 10 images are drawn, one for each digit in which the model is trained to recognize it

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
def plot_weights(): # Get the values for the weights from the TensorFlow variable. wi = session.run(w) # Get the lowest and highest values for the weights. # This is used to correct the colour intensity across # the images so they can be compared with each other. w_min = np.min(wi) w_max = np.max(wi) # Create figure with 3x4 sub-plots, # where the last 2 sub-plots are unused. fig, axes = plt.subplots(3, 4) fig.subplots_adjust(hspace=0.3, wspace=0.3) for i, ax in enumerate(axes.flat): # Only use the weights for the first 10 sub-plots. if i<10: # Get the weights for the i'th digit and reshape it. # Note that w.shape == (img_size_flat, 10) image = wi[:, i].reshape(img_shape) # Set the label for the sub-plot. ax.set_xlabel("Weights: {0}".format(i)) # Plot the image. ax.imshow(image, vmin=w_min, vmax=w_max, cmap='seismic') # Remove ticks from each sub-plot. ax.set_xticks([]) ax.set_yticks([]) |

## Performance before any optimization

Now that we have everything we need, we will run the classifier and do some performance tests.

As we have initialized the variables, the first thing we are going to look at is to see the level of precision that it has before executing any optimization.

let’s execute the accuracy function that we have created:

1 |
print_accuracy() |

Accuracy on test-set: 9.8%

The precision in the test set is 9.8%. This is because the model has only been initialized and has not been optimized at all.

Performance after 1 iteration of the optimization

We are going to use the optimization function that we created in an interaction:

1 2 |
optimize(num_iterations=1) print_accuracy() |

Result:

Accuracy on test-set: 40.9%

As seen after a single iteration of optimization, the model has increased its accuracy in the test set to 40.7%. This means that you incorrectly classify the images approximately 6 out of 10 times.

Tension weights \(w\) can also be traced as shown below. The positive weights take the red tones and the negative weights take the blue tones. These weights can be understood intuitively as image filters.

let’s use the plot_weights () function

1 |
plot_weights() |

For example, the weights used to determine if an image shows a zero digit have a positive reaction (red) as the image of a circle and have a negative reaction (blue) to images with content in the center of the circle.

Similarly, the weights used to determine whether an image shows digit 1 reacts positively (red) to a vertical line in the center of the image, and reacts negatively (blue) to images with the content surrounding that line .

In these images, the weights mostly resemble the digits they are supposed to recognize. This is because only one optimization iteration has been done, so the weights are only trained on 100 images. After training on thousands of images, the weights become more difficult to interpret because they have to recognize many variations of how the digits can be written.

Performance after 10 iterations

1 2 3 |
# We have already performed 1 iteration. optimize(num_iterations=9) print_accuracy() |

Result:

Accuracy on test-set: 78.2%

**The weights**

1 |
plot_weights() |

**Performance after 1000 iterations of optimization**

1 2 3 4 |
# We have already performed 1000 iteration. optimize(num_iterations=999) print_accuracy() plot_weights() |

We will get:

Accuracy on test-set: 92.1%

Weights:

After 1000 iterations of optimization, the model only misclassifies one out of every ten images. This simple model can not achieve much better performance and, therefore, more complex models are needed. In the subsequent tutorials, we will create a more complex model using neural networks that will help us improve the performance of our classifier.

Finally, to have a global view of the errors committed by our classifier, we will analyze the confusion matrix. We use our new function print_confusion_matrix ().

1 |
print_confusion_matrix() |

1 2 3 4 5 6 7 8 9 10 |
[[ 961 0 0 3 0 7 3 4 2 0] [ 0 1097 2 4 0 2 4 2 24 0] [ 12 8 898 23 5 4 12 12 49 9] [ 2 0 10 927 0 28 2 9 24 8] [ 2 1 2 2 895 0 13 4 10 53] [ 10 1 1 37 6 774 17 4 36 6] [ 13 3 4 2 8 19 902 3 4 0] [ 3 6 21 12 5 1 0 936 2 42] [ 6 3 6 18 8 26 10 5 883 9] [ 11 5 0 7 17 11 0 16 9 933]] |

Now that we have finished using TensorFlow, we closed the session to free up their resources.

1 |
session.close() |

With this, we finished the first tutorial on how to use the TensorFlow library to create a linear regression model.

In the next installments, we will create more complex models, including convolutional neural networks.