cat and dog image recognition

Are you Java Developer and eager to learn more about Deep Learning and his applications, but you are not feeling like learning another language at the moment ? Are you facing lack of the support or confusion with Machine Learning and Java?

Well you are not alone , as a Java Developer with more than 10 years of experience and several java certification I understand the obstacles and how you feel.

From my experience I know what obstacles a Java software engineering faces with the Deep Learning so I can be of a great help to you in making the journey with deep learning an exciting experience.

In this post we are going to develop a Cat&Dog Recognizer Java Application using deeplearning4j.If you would like to experiment on your own cat or dog feel free to check out the source code or download the application(fairly short instructions at the end).

Computer Vision Nature

Although with the great progress of deep learning, computer vision problems tend to be hard to solve. One of the reason is because Neural Networks(NN) are trying to learn a highly complex function like Image Recognition or Image Object Detection. We have a bunch of pixels values and from there we would like to figure out what is inside, so this really is a complex problem on his own.

Another reason why even today Computer Vision struggle is the amount of date we have. For sure the amount of data we have now is way bigger than before but still it looks like is not enough for Computer Vision problems. In particular Image Object Detection has even less data in comparison to Image Recognition(is a cat? is a dog? is a flower?) because it requires more intensive data labeling(going in each image and specifically mark each object).

Because Computer Vision is hard, traditionally it has developed complex architectures and techniques to achieve better results. We saw in previous post how adding Convolution(specialized image feature detectors) to Neural Networks greatly improved the performance in handwritten digit recognizing problem(97% to 99.5%) but in the same time introduced higher complexity ,parameters and greatly increased training time(to more than 2 hours).

Usually a NN that worked for particular image recognition problem can also work for other image related problems. So fortunately there are several ways we can approach Computer Vision problems and still be productive and have great results:

We can re-use already successfully known architectures by reducing the time needed for choosing different neural hidden layers, convolution layers other configuration parameters(learning rate).

We can re-use already trained Neural Networks(maybe someone already let NN to learn for weeks or months) by cutting the training time with great factor(transfer learning)

Play with training data by cropping, color change, rotate… to obtain more data so we can help NN learn more and be smarter.

Lets see how we can solve the problem of Detecting a Cat&Dog!

Well Known Architectures

LeNet – 5

This is a classical Neural Network architecture successfully used on handwritten digit recognizer problem back in 1998. You can find more information also for other versions of LeNet architecture here . There is an already existing implementation in deeplearning4j library in github(although not exactly as the paper).

LeNet – 5 architecture looks like below(if not familiar with convolution please have a quick look here):

In principle this architecture introduced the idea of applying several convolution and pooling layer before going to connect to a Neural Network and than to the outputs.

So it takes as input a 32x32x1(third dimension is one for black and white for RGB it will be 3) matrix than applies a 6 Convolution 5×5 matrices which by applying formula described in details here gives a 28x28x6 matrix. Notice how that third dimension is equal to the number of convolution matrices. Usually convolution will reduce first two dimension(width X height) but increase the third dimension(channels).

After that we apply a 2×2 with stride 2 Max Pooling Layer(in paper was average pool) which gives a matrix 14x14x6. Notice how Pooling Layer left the third dimension unchanged but reduced first two(width X height) by dividing with 2 so Pooling Layers are used to reduce only first two dimensions.

Additionally we apply 16 Convolution 5×5 matrices which gives a 10x10x16 and than by adding 2×2 Max Pooling we end up with 5x5x16.

We use the output 5x5x16 of several convolution and pooling to feed a 500 neural network with only one hidden layer and 10 outputs(0-9 digits). The model has to learn approx. 60.000 parameters.

According to paper this model was able to achieve a 99.05 which is impressive!

AlexNet

This is rather a more modern architecture(2012) which works on RGB colored imaged and has way more convolutions and full connected Neurons. This architecture showed great results and therefore convinced a lot of people that deep learning works pretty well for image problems. Anyway we will see that in a way this is similar to LeNet – 5 just bigger and deeper because at that time the processing power was also way greater(gpu’s were widely introduced).

There is also an already existing implementation in deeplearning4j library in github.

The architecture will look like below:

We start with more pixels and also colored images 224x224x3 RGB image. In principle is the same as LeNet – 5 above but just with more convolutions and pooling layers.Convolutions are used to increase third dimension and usually leave first two dimension unchanged(except the first one with stride s=4). Pooling layers are used to decrease(usually by dividing with two) the first two dimension(width X height) and leave the third dimension untouched. If you are wondering about Conv Same it simply means leave two first dimension (width X height) unchanged. Following formulas described on previous post is fairly easy to get same values as in picture.

After adding several convolution and pooling layers we end up with a 6x6x256 matrix which is used to feed a big Neural Network with three hidden layers respectively 9216, 4096,4096.

AlexNet is trying to detect more categories,1000 of them in comparison to LeNet – 5 which had only 10(0-9 digits) in same time it has way more parameters to learn approx. 60 million(100 times more than LeNet – 5).

VGG – 16

This architecture from 2015 beside having even more parameters is also more uniform and simple. Instead of having different sizes of Convolution and pooling layers VGG – 16 uses only one size for each of them and than just applying them several times.

There is also an already existing implementation in deeplearning4j library in github.

It always uses Convolution Same 3X3XN with stride S=1, the third dimension differs from time to time to increase/decrease the third dimension(N). Also it uses Max Pooling 2×2 stride S=2, pooling layer always have the same third dimension value as input(they play only with width and height) so we do not show the third dimension. Lets see how this architecture will look like:

Notice again how step by step height and width was decreased by adding Pooling Layers and channels(third dimension) increased by adding Convolutions. Although the model is bigger in same time is easier to read and understand thanks to the uniform way of using convolution and pooling layers.

This architecture has 138 million parameters , approx 3 times more than AlexNet(60 million) and similarly it try to detect 1000 image categories.

Other Great Architectures

There more architectures even bigger and deeper than the three above. For implementation and list of the other architecture please refer at deeplearning4j classes list on github. But just to mention a few there is also:

VGG- 19 a bigger version of VGG – 16
ResNet – 50 which uses Residual Neural Networks
GoogLeNet which uses Inception Networks. Was developed by Google and took the name in honor to the classic LeNet – 5.

Transfer Learning

One great thing about Machine Learning applications is that they are highly portable between different frameworks and even programming languages. Once you trained a neural network what you get is a bunch or parameters values(decimal values). In case of LeNet-5 60.000 parameter values, AlexNet 60 million and for VGG- 16 138 million. Than we use those parameters values to classify new coming images into one of 1000 in case of AlexNet and VGG-16 and 10 for LeNet-5.

In a few words the most valuable part of our application are the parameters. If we save parameters on disk and load them later we will get the same result as prior to saving (for same previously predicted images). Even if we save with python and load with java(or other way around) we will get the same result assuming the Neural Network implementation is correct on both of them.

Transfer learning as the name suggest it transfers already trained neural weights to others. Others can be different machines, operating systems, frameworks, languages like java,python or anything as long as you can read/save the weights values.

It maybe someone else already trained the network for really long time like weeks or months and with transfer learning we can re-use that work in a few minutes and start from there. Beside we get for free the painful tuning of hyper-parameters it is especially useful when we do not have a lot of processing power(someone else trained with thousands of GPU’s).As we will see later deeplearning4j already has the ability to save and load pretrained neural networks even from frameworks like Keras.

There are several things we can do once we load a pretrained neural network:

Directly use the network to classify or predict for already trained outputs
We can modify only the output layer from lets say 1000 to 5 and freeze everything else. So we train only from last layer to output and re-use everything else by speeding up the training time. Freezing means we do not train and not use any processing power for those layer parameters but rather use as they are.
Freeze some of the layers and add/remove other layers. Than we only train the network on new added layers and not freeze/removed layers. The reason why we freeze some of the layers is because most of the layers already are useful for our problem as well(we have similar problem to trained network) and it will take long time to train all layers(without improving much).
We use the new wights only as initial values and than train all the network including also possible new added layers. Usually this a good choice when we have a lot of processing power and more data so we are almost sure this will bring new findings and maybe better performance.

As we will see deeplearning4j supports freezing layers and adding/removing layers to a pretrained neural network.

Cat&Dog Recognizer

Data

As always every Machine Learning problem starts with the data. The amount and quality of data are very crucial for the performance of system and most of the time it requires great deal of effort and resources. So we need to rely on online public data sets as a start and than try to augment or transform existing images to create a larger variety.

For cat&dog recognizer problem fortunately we have a good data set provided by Microsoft. Also the same data set can be found on Kaggle. Originally this is a Dog & Cat data set with 12.500 cat photos and 12.500 dog photos and with 12.500 dog&cat as a test data set.

Chosen Architecture

Since 2010, ImageNet has hosted an annual challenge where research teams present solutions to image classification and other tasks by training on the ImageNet dataset. ImageNet currently has millions of labeled images; it’s one of the largest high-quality image datasets in the world. The Visual Geometry group at the University of Oxford did really well in 2014 with: VGG-16 and VGG-19(results). We will choose VGG-16 trained with ImageNet for our cat problem because it is similar to what we want to predict. VGG-16 with ImageNet already is trained to detect different races of cats&dogs, please find here the list(search with ‘cat’,’dog’).

The size of all trained weights and the model is about 500MB so if you are going to use the code to train it may take few moments to download the pretrained weights first. The code in deeplearning4j for downloading VGG-16 trained with ImageNet looks like below:

ZooModel zooModel = new VGG16();
ComputationGraph pretrainedNet = (ComputationGraph) zooModel.initPretrained(PretrainedType.IMAGENET);

Architecture Adaption

VGG-16 predicts 1000 classes of images while we need only two;if the image is cat or a dog. So we need to slightly modify the model to output only two classes instead of 1000. Everything else we leave it as it is since our problem is similar to what VGG-16 is already trained for(freeze all other layers). Modified VGG-16 will look like below:

In gray is marked the part which is freeze so we do not use any processing power to trained but instead use the weights initial downloaded values.In green is the part we trained so we are going to train only 8192 parameters (from 138 million)from the last layer to the two outputs. The code will look like below:

FineTuneConfiguration fineTuneConf = new FineTuneConfiguration.Builder()
        .learningRate(5e-5)
        .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
        .updater(Updater.NESTEROVS)
        .seed(seed)
        .build();

ComputationGraph vgg16Transfer = new TransferLearning.GraphBuilder(preTrainedNet)
        .fineTuneConfiguration(fineTuneConf)
        .setFeatureExtractor(featurizeExtractionLayer)
        .removeVertexKeepConnections("predictions")
        .addLayer("predictions",
                new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
                        .nIn(4096).nOut(NUM_POSSIBLE_LABELS)//2
                        .weightInit(WeightInit.XAVIER)
                        .activation(Activation.SOFTMAX).build(), featurizeExtractionLayer)
        .build();

The method which freezes the weights is setFeatureExtractor.From java doc of deeplearning4j :

/**
 * Specify a layer vertex to set as a "feature extractor"
 * The specified layer vertex and the layers on the path from an input vertex to it it will be "frozen" with parameters staying constant
 * @param layerName
 * @return Builder
 */
public GraphBuilder setFeatureExtractor(String... layerName) {
    this.hasFrozen = true;
    this.frozenOutputAt = layerName;
    return this;
}

So everything from the input to the layer name you defined will be freeze. If you are wondering what is the layer name and how to find than you can print first the model architecture like below:

ZooModel zooModel = new VGG16();
ComputationGraph pretrainedNet = (ComputationGraph) zooModel.initPretrained(PretrainedType.IMAGENET);
log.info(pretrainedNet.summary());

After that you will get in console something that looks like below:

Notice that trainable parameters are equal to total parameters 138 million. In our case we are going to freeze from input to the last dense layer which ‘fc2’ so featurizeExtractionLayer variable value will be “fc2”. Please find below a view after freeze:

Notice how names are ending with frozen now and the trainable parameters changed from 138 million to 8194(8192+ 2 bias parameters).

Train and Results

Now we are ready to train the model and this results to fairly few lines of code:

DataSetIterator testIterator = getDataSetIterator(test.sample(PATH_FILTER, 1, 0)[0]);
int iEpoch = 0;
int i = 0;
while (iEpoch < EPOCH) {
    while (trainIterator.hasNext()) {
        DataSet trained = trainIterator.next();
        vgg16Transfer.fit(trained);
        if (i % SAVED_INTERVAL == 0 && i != 0) {

            ModelSerializer.writeModel(vgg16Transfer, new File(SAVING_PATH), false);
            evalOn(vgg16Transfer, devIterator, i);
        }
        i++;
    }

    trainIterator.reset();
    iEpoch++;

    evalOn(vgg16Transfer, testIterator, iEpoch);
}

We are using a batch size of 16 and 3 epochs. First while loop will be executed three times since epoch=3.Second inner while loop will be executed 1563 (25.000 cats and dogs/16).One epoch is full traversal through the data and one iteration is one forward and back propagation on the batch size(16 images in our case). So our model learns with small steps of 16 images and each time becomes smarter and smarter.

Before was common to not train Neural Networks with batches but rather feed all the data at once and have epochs with bigger values like 100,200… In modern Deep Learning Era due to the really big amount of data this way is not used anymore because is really slow. If we feed the network all the data at once than we will wait until the model iterates all the data(million of images )before making any progress with learning while with batch we have the model learning and progressing faster with small steps. There is more about batch vs no batch and is out of this post scope so we will leave for another post.

You can find the full code used for training in github. For the first time it has to download and unzip 600MB of data images to resources folder, so this may take some time for the first run.

After training on 85% of training set(25000) for few hours(3 hours) we were able to get below results(code used for evaluating):

Dev Set Accuracy

15% of Training Set Used as Dev Set

Examples labeled as cat classified by model as cat: 1833 times
Examples labeled as cat classified by model as dog: 42 times
Examples labeled as dog classified by model as cat: 31 times
Examples labeled as dog classified by model as dog: 1844 times

==========================Scores==========================
# of classes: 2
Accuracy: 0.9805
Precision: 0.9805
Recall: 0.9805
F1 Score: 0.9806
=========================================================

Test set Accuracy

1246 Cats and 1009 Dogs

Examples labeled as cat classified by model as cat: 934 times
Examples labeled as cat classified by model as dog: 12 times
Examples labeled as dog classified by model as cat: 46 times
Examples labeled as dog classified by model as dog: 900 times

==========================Scores========================================
# of classes: 2
Accuracy: 0.9693
Precision: 0.9700
Recall: 0.9693
F1 Score: 0.9688
========================================================================

Application

Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer. Feel to try with your own cat.

It is possible to run the from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java.

Please be aware for the first time it will download 500MB wights from dropbox so it may take some time depending on network.
To speedup training time model was trained only on Cats and Dogs images therefore the model has not seen(trained) other than cats or dogs. So to avoid predicting non cat or dog images as cat or dog the threshold was increased to 0.95). Meaning that we classify as cat or dog only when the confidence of the model is very high like 95%. In reality the threshold will be much lower like 50%(0.5) so if you are not satisfy with the prediction try to lower the threshold.
The downloadable application has the threshold set to 95%.
Training is not provided with graphical user interface since it really takes a lot of time and memory. Anyway feel free to train and experiment on you own choice by modifying this class. Please expect some time for the code to download 500MB of VG-166 ImageNet weights and training data 600MB for the first time.