Getting started with Image Recognition: Classifying Cats and Dogs

This article kicks off a new blog series on image recognition and classification with Convolutional Neural Networks (CNNs). CNNs belong to the field of deep learning, a subarea of machine learning, and have become a cornerstone to many exciting innovations of our time. From self-driving cars, over biometric security to automated tagging in social media, there are nearly endless applications. And the importance of CNNs grows steadily! So there are plenty of reasons to better understand this technology actually works and how we can implement it.

I have divided this article into two parts: Part A introduces the core concepts behind CNNs and explains their use in image classification. Part B is a hands-on tutorial in which you will build your own CNN that can distinguish images of cats and dogs. We will work with tensorflow and python to integrate different types of layers, such as Convolution Layers, Dense layers and MaxPooling. Furthermore, we will prevent the network from overfitting the training data by using Dropout between the layers. You will also learn how to load the model and make predictions on a fresh set of images. The model that you will develop in this tutorial will achieve around 82% validation accuracy.

Part A) Intro to Image Classification with CNNs

The history of image recognition dates back to the mid-1960s, when first attempts were made to identify objects by coding their characteristic shapes and lines. However, this task turned out to be incredibly complex. In fact, the human brain is so trained to recognize objects that one can easily forget how diverse the conditions of observation can be. Here are some examples:

  • Fotos can be taken from various viewpoints
  • Living things can take various forms and poses
  • Objects come in different forms, colors and sizes
  • Parts of the objects may be hidden in the picture
  • The light conditions vary from image to image
  • There may be one or multiple objects in the same image

With the beginning of the 1990s, the focus of research shifted to statistical approaches and learning algorithms.

The Emergence of CNNs

The basic concept of a neural network in computer vision has existed since the 1980s. It goes back to research from Hubel and Wiesel’s on the emergence of a cat’s visual system. They found out that the visual cortex has cells that are activated by certain shapes and their orientation in the visual field. Some of their findings inspired the development of crucial technologies of computer vision, such as for example hierarchical features with different levels of abstraction [1, 2]. However, it took another three decades of research and the availability of faster computers before the emergence of the modern CNN.

The year 2012 was a defining moment for the use of CNNs in image recognition. In this year, for the first time a CNN won the ILSVRC competition for computer vision. The challenge was to classify more than hundred thousand images in 1000 object categories. With an error rate of only 15,3% the succeeding model was a CNN called “AlexNet.”. It was the first model to achieve more than 75% accuracy. In the same year, CNNs succeeded in several other competitions. In 2015, the CNN ResNet even exceeded human performance in the ILSVRC competition. Only a decade ago this achievement was considered almost impossible. To understand this surge in performance, let us first look at what a picture actually is.

Top performing models in the ImageNet image classification challenge (Alyafeai & Ghouti, 2019)

What is an Image?

A digital image is a three-dimensional array of integer values. One dimension of this array represents the pixel width and one dimension represents the height of the image. The third dimension contains the color depth, which in turn is defined by the image format. As shown below, we can thus define the format of a digital image as “width x height x depth”. Next, let’s have a quick look at different image formats.

A digital images is a multidimensional integer array

Image Formats

CNNs can be trained with different image formats, but the input data are always multi-dimensional arrays of integer values. One of the most commonly used color formats in deep learning is “RGB”. RGB stands for the three color channels: “Red”, “Green”, and “Blue”. RGB images are divided into three layers of integer values, one layer for each color channel. The integer values of a 16-bit RGB image in each layer range from 1 to 255. Together, the three layers can reproduce 65,536 different colors.

In contrast to RGB images, grey scale images only have a single color layer. This layer resembles the brightness of each pixel in the image. Consequently, the format of a grey-scale image is width x height x 1. Using grey-scale images or images with black and white shades instead of RGB images can speed up the training process, because less data needs to be processed. Image data with multiple color channels, however, provide the model with more information and can therefore lead to better predictions. The RGB format is often a good choice between prediction quality and performance. Next, let’s look at how CNNs handle digital images in the learning process.

Convolutional Neural Networks

As mentioned before, a CNN is a specific form of an artificial neural network. The main differentiator between the CNN and the standard multi-layer perceptron are their convolutional layers. CNNs can have other layers as well, but what makes the CNN so good in detecting objects are really the convolutions. They allow the network to identify patterns based on features that work regardless of where in the image they occur. Let’s see how this works in more detail.

Process of convolutions and filters in a CNN

Convolutional Layers

The main purpose of the convolutional layers is to extract meaningful features from the input images. To identify these features, CNNs use a rasterizing technique that breaks down an image into smaller groups of pixels called filters. Filters act as feature detectors from the original image. During the training process, the CNN slides the filter over all locations of the image and calculates the dot product for each feature at a time. The results of these calculations are stored in so called feature map (sometimes also called activation map), which are representation of where in the image a certain feature was identified. Subsequently, the values from the feature map are transformed with an activation function (usually ReLu) and are used as input to the next layer.

Features become more complex with increasing depth of the network. In the first layer of the network, convolutions will detect generic geometric forms and detect low-level features based on edges, corners, squares, or circles. The subsequent layers of the network will look at shapes that are more sophisticated. For example, the feature map will then contain features that resemble the form of an eye of a cat, or the nose of a dog. In the final layers, high-level features may even cover the face or the body of a cat. In this way, convolutions provide the network with features at different levels of detail that enable powerful detection patterns.

Convolutions at the example of an image that contains the number “3”

Pooling / Downsampling

A convolutional layer is usually followed by a pooling operation, which is used to reduce the amount of data by filtering unnecessary information. This process is also called downsampling or subsampling. There are various forms of pooling. In the most common variant – max pooling – only the highest value in a predefined grid (e.g., 2×2) is processed and the remaining values are discarded. For example, for a 2×2 grid with values 0.1, 0.5, 0.4, and 0.8, only the 0,8 would be processed further and used as part of the input to the next layer. The advantages of pooling are a reduced amount of data and therefore faster training times. Because, pooling reduces the complexity of the network, it allows for the construction of deeper architectures with more layers. In addition, pooling offers a certain protection against overfitting during training.

Dropout

Dropout is another technique that helps to prevent the network from overfitting the training data. When dropout is activated for a layer, this will remove a random number of neurons from the layer per training step. As a result, the network needs to learn patterns that give less weight to individual layers and thus generalize better. The dropout rate controls the percentage of neurons that are switched off in each training iteration. It can be configured for each layer separately. A typical dropout rate lies in a range of 10% to 30%. CNNs with many layers and training epochs tend to overfit the training data. Especially here, the use of dropout is crucial to avoid overfitting and to achieve good prediction results with data that the network does not know yet.

Multi-Layer Perceptron (MLP)

The CNN architecture is completed with multiple dense layers, that are fully connected. The layers are part of a Multilayer Perception (MLP), which has the task to dense down the results from the previous convolutions and output one of multiple classes. Consequently, the number of neurons in the final dense layer usually corresponds to the number of different classes to be predicted. For two-class prediction problems, it is also possible to use a single neuron in the final layer, which then predicts a binary label of 0 or 1.

Part B) Building a CNN with Tensorflow that Classifies Cats and Dogs

Now that you have learned the main concepts of a CNN, let’s start with the practical part. In the following, we will train a CNN to distinguish images of cats and dogs. To do this, we first define a CNN model and then feed it a few thousand images from a public dataset with labeled images of cats and dogs. As always, you’ll find the code of this tutorial on my Github page.

Distinguishing cats and dogs may not sound like a difficult task at first, but imagine the almost infinite circumstances in which animals can be photographed, not to mention the many forms a cat can take. These variations lead to the fact, that even humans will sometimes confuse a a cat with a dog or verse vice. So don’t expect our model to be perfect right from the start. Spoiler our model will score a around 82% accuracy on the validation dataset.

Cat or Dog? That’s what our CNN will predict

0) Prerequisites

Setup the Environment

This tutorial assumes that you have setup your python environment. I recommend using Anaconda but any other environment will do as well. If you don’t have an environment set up yet, you can follow this tutorial.

It is also assumed that you have the following packages installed: tensorflow (2.0 or higher), pandas, scikit-learn, numpy, seaborn, and matplotlib. The packages can be installed using the console command: pip install <package name> or, if you are using the anaconda packet manager, conda install <package name>. Our model will use Keras via the Tensorflow API, so no separate installation of Keras is required.

Download the Dataset

We will train our model with a public dataset from Kaggle.com. The dataset contains more than 25.000 JPG-pictures of cats and dogs. The images are uniformly named and numbered, for example, dog.1.jpg, dog.2.jpg, dog.3.jpg and cat.1.jpg, cat.2.jpg and so on. You can download the picture set directly from Kaggle: cats-vs-dogs.

Setup the Folder Structure

There are different ways on how data can be structured and loaded during model training. One approach (1) is to split the images into classes and create a separate folder for each class, class_a and class_b, and so on. Another approach (2), is to put all images into a single folder and define a DataFrame which splits the data into test and train. Because the files in the cats and dogs dataset already contain the classes in their name, I decided to go for the second approach.

Before we begin with the coding part, we create a folder structure that looks as follows:

Folder structure of our cats and dogs prediction project

If you want to use the standard pathways given in the python tutorial, make sure that your notebook resides in the parent folder of the “data” folder.

After you have created the folder structure, open the cats-vs-dogs zip file. The ZIP file contains the folders “train”, “test” and “sample”. Unzip the JPG files from the “train” (20.000 images) and the “test” folder (5.000 images) to the “train” folder of your project. Afterward, the train folder should contain 25.000 images. The sample folder is intended to contain your own sample images, for example of your pet. We will later use the images from the sample folder to test the model on fresh real world data.

With this we have fulfilled all requirements and can start with the coding part.

1) Make Imports and Check Training Device

We begin by setting up the imports for this project. The decision to put the imports at the beginning is mainly because I want to give you a quick overview of the packages to be installed.

Using the GPU instead of the CPU allows for faster training times. However, sometimes setting up Tensorflow to work with the GPUs can cause problems. Also, some of you may not even have a GPU at hand – in this case, tensorflow should usually automatically run all code on the CPU. However, should you for any reason prefer to manually switch to CPU training, simply change [“CUDA_VISIBLE_DEVICES”]=”1″ to “-1”. This will force Tensorflow to run all code on the CPU and ignore all available GPUs.

import os
#os.environ["CUDA_VISIBLE_DEVICES"]="-1" 

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D
from tensorflow.keras.layers import Conv2D, Activation, Dropout, Flatten, Dense, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.metrics import Accuracy
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.python.client import device_lib
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

tf.config.allow_growth = True
tf.config.per_process_gpu_memory_fraction = 0.9

from random import randint
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
from PIL import Image
import random as rdn

Next, with the command below, we perform a quick check on the tensorflow version and the number of available GPUs in our system.

# check the tensorflow version
print('Tensorflow Version: ' + tf.__version__)

# check the number of available GPUs
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))
Tensorflow Version: 2.4.0-rc3
Num GPUs: 1

As you can see above, I am using the pre-release version of tensorflow (2.4.0-rc3). This is, because my GPU is a RTX 3080, which at the time of when I am writing this articles, is not yet supported by the standard tensorflow release. In most cases the standard release (2.3) should work just fine.

In my case, the GPU check returns 1, because I have a single GPU in my computer. If tensorflow doesn’t recognize any GPU, this command will return 0. Tensorflow will then run on the CPU.

2) Define the Prediction Classes

Next, we will define the path to the folders that contain our train and validation images. In addition, we will define a dataframe “image_df”, which contains all the pictures from the “train” folder. With the help of this dataframe we can later split the data, simply by defining which images from the train folder contain to the training dataset and which belong to the test dataset. Important note: the dataframe “image_df” does only contain the names of the images along with the classes, but not the images themselves.

# set the directory for train and validation images
train_path = 'data/cats-and-dogs/train/'
test_path = 'data/cats-and-dogs/test/'

# function to create a list of image labels 
def createImageDf(path):
    filenames = os.listdir(path)
    categories = []

    for fname in filenames:
        category = fname.split('.')[0]
        if category == 'dog':
            categories.append(1)
        else:
            categories.append(0)
    df = pd.DataFrame({
        'filename':filenames,
        'category':categories
    })
    return df

# display the header of the train_df dataset
image_df = createImageDf(train_path)
image_df.head(5)

It’s a good practice is to check the distribution of classes in the training data set. For this purpose, we create a bar plot, which illustrates the number of both classes in the image data. And yes I admit, I choose some custom colors to make it look fancy.

# Print the number of images in each class
clist = [(0, "purple"), (1, "blue")]
rvb = mcolors.LinearSegmentedColormap.from_list("", clist)
x = np.arange(0, len(["dog","cat"]))
y = image_df['category'].value_counts()
N = y.size

fig, ax = plt.subplots(figsize=(6, 2))
ax.barh(x, y, color=rvb(x/N))

That’s nice, the number of images in the two classes is totally balanced, so we don’t need to rebalance the data.

3) Plot Sample Images

I prefer to not directly jump into preprocessing and first check that the data has been correctly loaded. We will do this by plotting some random images from the train folder. This step is not necessary, but it’s a best practice.

n_pictures = 16 # number of pictures to be shown
columns = int(n_pictures / 2)
rows = 2
plt.figure(figsize=(40, 12))
for i in range(n_pictures):
    num = i + 1
    ax = plt.subplot(rows, columns, i + 1)
    if i < columns:
        image_name = 'cat.' + str(rdn.randint(1, 1000)) + '.jpg'
    else: 
        image_name = 'dog.' + str(rdn.randint(1, 1000)) + '.jpg'
    plt.xlabel(image_name)    
    plt.imshow(load_img(train_path + image_name)) 

#if you get a deprecated warning, you can ignore it

I never expected to have so many pictures of cats and dogs one day, but I guess neither did you 🙂 Anyway, neural networks require a fixed input shape where each neuron corresponds to a pixel value. However, as we can see from the sample images, the images in our dataset have different sizes and aspect ratios. In order for the images to fit into the input shape of our neural network, we need to put the images into a common format. But before we can do that, the next step is to first split the data into two datasets.

4) Split the Data

Similarly to other classification problems, image classification requires us to split the data into a train and a validation set. We define a split ratio of 1/5, so that 80% of the data goes into the train dataset and 20% goes into the validation dataframe. We shuffle the data in the process, so that we end up with two DataFrames with a mix of random cat and dog pictures. In addition, we transform the classes of the images into categorical values 0->”cat” and 1->”dog”. This gives us two new DataFrames: train_df (20.000 images) and validate_df (5.000 images).

image_df["category"] = image_df["category"].replace({0:'cat',1:'dog'})

train_df, validate_df = train_test_split(image_df, test_size=0.20, random_state=42)
train_df = train_df.reset_index(drop=True)
total_train = train_df.shape[0]

validate_df = validate_df.reset_index(drop=True)
total_validate = validate_df.shape[0]
train_df.head()

print(len(train_df), len(validate_df))
Output: 20000 5000

5) Preprocess the Images

The next step is to define two data generators for these DataFrames, which use the names given in the train and validation DataFrames to feed the images from the “train” path into our neural network. The data generator has various configuration options. We will perform the following operations:

  • Rescale the image by dividing their RGB color values (1-255) by 255
  • Shuffle the images (again)
  • Bring the images into a uniform shape of 128 x 128 pixels
  • We define a batch size of 32, so that 32 images will be processed at the same time.
  • The class mode is “binary”, so that the our two prediction labels are encoded as float32 scalars with values 0 or 1. Tis means, we will only have a single end neuron in our network.
  • We perform some data augmentation techniques on the training data (incl. horizontal flip, shearing and zoom). In this way, the model never see different variants of the images, which helps to prevent overfitting.
Some augmentation techniques

It is important to mention that the input shape of the neural network for which the data is intended must correspond to the image shape of 128 x 128.

# set the dimensions to which we will convert the images
img_width, img_height = 128, 128
target_size = (img_width, img_height)
batch_size = 32
rescale=1.0/255

# configure the train data generator
print('Train data:')
train_datagen = ImageDataGenerator(rescale=rescale)
train_generator = train_datagen.flow_from_dataframe(
    train_df, 
    train_path,
    shear_range=0.2, #
    zoom_range=0.2, #
    horizontal_flip=True, # 
    shuffle=True, # shuffle the image data
    x_col='filename', y_col='category',
    classes=['dog', 'cat'],
    target_size=target_size,
    batch_size=batch_size,
    color_mode="rgb",
    class_mode='binary')

# configure test data generator
# only rescaling
print('Test data:')
validation_datagen = ImageDataGenerator(rescale=rescale)
validation_generator = validation_datagen.flow_from_dataframe(
    validate_df, 
    train_path,    
    shuffle=True,
    x_col='filename', y_col='category',
    classes=['dog', 'cat'],
    target_size=target_size,
    batch_size=batch_size,
    color_mode="rgb",
    class_mode='binary')
Train data:
Found 20000 validated image filenames belonging to 2 classes.
Test data:
Found 5000 validated image filenames belonging to 2 classes.

At this point, we have already completed the data preprocessing part. The next step is to define and compile the convolutional neural network.

6) Define and Compile the Convolutional Neural Network

In this section we will define and compile our CNN model. The way of doing this is by defining multiple layers and stacking them on top of each other. The architecture of our CNN is inspired by the famous VGGNet. However, to lower the amount of time needed to train the network, I had to reduce the number of layers.

The initial layer of our network is the initial input layer, which receives the preprocessed images. As already noted, the shape of the input layer needs to match the shape of our images. Considering how we have defined the format of the images in our data generators, the input shape is defined as 128 x 128 x 3.

The subsequent layers are four convolutional layers, each of which is followed by a pooling layer. In addition, we define a Dropoutrate of 20% for each convolutional layer.

Finally, a fully connected output layer with 128 neurons and a binary layer for the output complete the structure of the CNN.

# define the input format of the model
input_shape = (img_width, img_height, 3)
print(input_shape)

# define  model
model = Sequential()
model.add(Conv2D(32, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=input_shape))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(128, (3, 3),  strides=(1, 1),activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Flatten())
model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# compile the model and print its architecture
opt = SGD(lr=0.001, momentum=0.9)
history = model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
print(model.summary())

At this point we have defined and assembled our convolutional neural network. Next, it is time to train the model.

7) Train the Model

Before we start to train the network, we still have to choose the number of epochs. More epochs can improve the model performance, but also lead to longer training times. In addition, the risk increases that the model overfits. Finding the optimal number of epochs is therefore not so easy and often requires a trial and error approach. I typically start with a small number of 5 epochs and then increase this number until increases do not lead to significant improvements.

# train the model
epochs = 40
early_stop = EarlyStopping(monitor='loss', patience=6, verbose=1)

history = model.fit(
    train_generator,
    epochs=epochs,
    callbacks=[early_stop],
    steps_per_epoch=len(train_generator),
    verbose=1,
    validation_data=validation_generator,
    validation_steps=len(validation_generator))

A quick comment on the required time to train the model. Although the model is not overly complex and the size of the data is still moderate, training the model can take some time. I made two training runs. One on my GPU (Nvidia Geforce 3080 RTX) and one on my CPU (AMD Ryzen 3700x). On the GPU, training took approximately 10 minutes. As expected, on the CPU training was much slower and took about 30 minutes, so three times longer compared to the GPU.

After training, you may want to save the model and load it at a later time. You can do this with the code below:
Note, however, that before loading, the model must be defined exactly as it was when it was trained.

# Safe the weights
model.save_weights('cats-and-dogs-weights-v1.h5')

# Define model as during training
# model architecture

# Loads the weights
model.load_weights('cats-and-dogs-weights-v1.h5')

8) Visualize Model Performance

So let’s check how our model performs on the validation dataset. Since, image classification is still classification, we can apply the same performance measures as in other classification projects. If you want to learn more about this topic, check out my previous post on Measuring Model Performance.

# plot training & validation loss values
fig, ax = plt.subplots(figsize=(15, 5), sharex=True)
plt.plot(history.history["loss"], 'b')
plt.plot(history.history["val_loss"], 'r')
plt.title("Model loss")
plt.ylabel("Loss")
plt.xlabel("Epoch")
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend(["Train", "Validation"], loc="upper left")
plt.grid()
plt.show()

# plot training & validation loss values
fig, ax = plt.subplots(figsize=(15, 5), sharex=True)
plt.plot(history.history["accuracy"], 'b')
plt.plot(history.history["val_accuracy"], 'r')
plt.title("Model accuracy")
plt.ylabel("Loss")
plt.xlabel("Epoch")
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend(["Train", "Validation"], loc="upper left")
plt.grid()
plt.show()

Next, let’s print the accuracy and a confusion matrix on the predictions from the validation dataset.

# function that returns the label for a given probability
def getLabel(prob):
    if(prob > .5):
               return 'dog'
    else:
               return 'cat'

# get the predictions for the validation data
val_df = validate_df.copy()
val_df['pred'] = ""
val_pred_prob = model.predict(validation_generator)

for i in range(val_pred_prob.shape[0]):
    val_df['pred'][i] = getLabel(val_pred_prob[i])
          
# create a confusion matrix
y_val = val_df['category']
y_pred = val_df['pred']

print('Accuracy: {:.2f}'.format(accuracy_score(y_val, y_pred)))
cnf_matrix = confusion_matrix(y_val, y_pred)

# plot the confusion matrix in form of a heatmap

%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(8, 8))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Accuracy: 0.82

9) Make Predictions on Sample Images

Now that we have trained our model, I bet you want to test the model on some sample data. For this purpose, ensure that you have some sample images in the “sample” folder. Then run the code below. This will predict the labels for the images from the sample folder. We then print the images in an image grid.

# set the path to the sample images
sample_path = "data/cats-and-dogs/sample/"
sample_df = createImageDf(sample_path)
sample_df['category'] = sample_df['category'].replace({0:'cat',1:'dog'})
sample_df['pred'] = ""

# create an image data generator for the sample images - we will only rescale the images
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_dataframe(
    sample_df, 
    sample_path,    
    shuffle=False,
    x_col='filename', y_col='category',
    target_size=target_size)

# make the predictions 
pred_prob = model.predict(test_generator)
image_number = pred_prob.shape[0]

# define the plot size
nrows = 6
ncols = int(round(image_number / nrows, 0))
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=False, figsize=(15, 15))
fig.subplots_adjust(hspace=0.5, wspace=0.5)

for i in range(pred_prob.shape[0]):
    sample_df['pred'][i] = getLabel(pred_prob[i])
    
print('Accuracy: {:.2f}'.format(accuracy_score(sample_df['category'], sample_df['pred'])))

# print the images
i = 0
for n in range(nrows):
    for c in range(ncols):
        if i < image_number:
            filepath = sample_path + sample_df.at[i ,'filename']
            img = Image.open(filepath).resize(target_size)
            plt.sca(ax[n, c])
            #plt.tick_params(axis = 'None')
            plt.xticks([]); plt.yticks([])
            plt.title(sample_df.at[i ,'filename'] + '\n' + ' predicted: '  + str(sample_df.at[i ,'pred']))
            plt.imshow(img)
            i += 1

Our model achieves an accuracy of around 83% on the validation set. Most images should thus be labeled correctly. With deeper architectures, more data and training runs, significantly better results over 95% can be achieved.

Summary

In this tutorial, you learned how to train a convolutional neural network to distinguish between dogs and cats. Provided they have enough image data, they can now use this knowledge to train models to distinguish any other objects. But there are many other cool things that can be done with CNNs. For example, object localization in images and videos. But this is a topic for another article.

I really hope you enjoyed the article and would be happy if you leave some feedback. Cheers

Sources

[1] D. H. Hubel and T. N. Wiesel – Receptive Fields of Neurons in the Cat’s Striate Cortex, The Journal of physiology (1959)
[2] C C. Aggarwal – Neural Networks and Deep Learning, Springer (2018)

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *