Well you are not alone , as a Java Developer with more than 10 years of experience and several java certification I understand the obstacles and how you feel.
From my experience I know what obstacles a Java software engineering faces with the Deep Learning so I can be of a great help to you in making the journey with deep learning an exciting experience.
In this post we are going to develop a java face recognition application using deeplearning4j. The application is offering a GUI and flexibility to register new faces so feel free to try with your own images. Additionally you can check out the free open source code as part of the PactPub video course Java Machine Learning for Computer Vision together with many new improvements to previous posts applications in java.
Face recognition has always been an important problem to solve due its sensitivity in regards to security and because it closely related to people identity. For many years face recognition applications were well known especially in criminology and searching for wanted persons with cameras and sometimes even using satellites. Now days, in deep-learning era face recognition is widely found from simple applications to unlocking your phone offering state of the art accuracy.
Lets first visit below the challenges related to the face recognition and than see how they are solved using deep learning techniques.
In previous posts we have already seen Image Classification and Object Detection where we were concerned mostly in finding out if an image was representing certain class like: is it a dog? is it a car? and also we saw how to mark the classified object with bounding box.
Now we are going one step further by uniquely identifying the objects.
So not just if the image is a car or not but additionally find out if it is specifically my car, your car or specifically someone’s car.(for animal classification we will need to find out if this John dog or Maria dog rather than just a dog)
Face verification is not different , just the logic is extended to human face.
The question is not if it is simply a human or not bur rather if: is a person with identification X or is company employ with some identification number….
Than we have the face recognition problem where we need to do the face verification for a group of people instead of just one. So to say if a new person is any of the persons in certain group.
Although face recognition and verification can be thought as same problem , the reason we treat it different is because face recognition can be much harder.
For instance lets suppose we achieved a face verification accuracy of 98% to verify if a person is the one it claims.
Which maybe is not that bad , now if we apply that model with 2% error rate to the face recognition with lets say 16 people it obviously is not going to work well since the error is 2% on 16 persons(32% error rate).
So for face recognition to work good and have reasonable accuracy taking also the sensitive nature of the problem we will need something like 99.99% accuracy.
Usually with face recognition we have only one photo of each of persons to recognize so next challenge is related to the problem know as “One shoot learning problem”.
So lets say that we want to recognize the employees as they come in.
Usually we really could have only one photo of each of the employees or maybe really a few of them in best case.
With the knowledge and application we have seen so fare we can of course feed all this photos to a neural network to learn and than have the network to predict classes for each of the employees. As much as may sound intuitive that will not really work well for below reasons.
The high level solution to the above problems is implemented through the similarity function. Instead of try to learn to recognize specific persons faces as classes, what if we learn a function d which measures how similar or different two images are ?
d(face_1,face_2) -> degree of difference between face images
If the function would return a value smaller than a constant γ we know that the images are quite similar otherwise we know they are different so not same face or person.
Supposing that on left we have the employees faces and on the right a person coming , now what will happen is that for each of the comparison we will have a number which will be big when the images are different and small when they are similar. So for above this case we know that the person is the third employee on our group since has the lowest number below e.x γ=0.8.
Additionally this solution also scales well since a new person joining would mean just a new comparison to execute. We do not need to retrain since the neural network has learnt a generic function to distinguish faces rather than specific faces.
Similarity function is just a high level explanation of solution so lets see below two ways how is implemented in practice.
We will still continue to use convolution architectures with many convolution layers and fully connected layers. With the exception that the last prediction layer (soft max layer)we will not be used or it will be cut.
We will feed the the first image X_{1} to the network then grab the last fully connected layer activations F(X_{1}) and save in memory;
We will repeat the same for the second image X_{2} that we want to compare or the new coming employee. So now we have the encoded activations for the second image F(X_{2}) saved in memory.
Notice that the network here stays the same for both images.
That’s where the Siamese name comes in since we use the same network(or to cloned networks) executions for both of images and in practice this happens in parallel.
Now the neural network for each iteration(through forward step and back-propagation ) will learn the function d as shown in the picture.
In a few words this will be the goal of our learning to shift the difference accordingly to small or large number depending if images are the same or different.
And only when encoded values are similar we will predict that two images are the same.Recalling from previous section this is exactly what refereed as the similarity function, the d denotes the distance so the distance between the activation of last layers of a very deep convolution network.
Triplet loss is another great way to solve the face recognition problem and the one we will use for our java application. The name triplet come from the fact that we use three images as just one training sample. Similarly we will use the activations of last fully connected layer of some very deep neural network.
We are going to choose first the base image or the anchor image which will be used as comparison with two other images and through a forward step we get the activation of last layer F(A).
Together with a different image but representing the same person so called the positive image similarly we get the activation of last layer F(P). Recalling from our previous section we want our similarity function d(A,P) so the difference between anchor and positive image activations to be in this case as close to zero as possible since this images represent the same person after all.
Now keeping the same anchor image we are going to choose an image that represent a different person so a negative image F(N). Function d(A,N) so the difference between anchor and negative image activations in this case will bigger than zero so we want the difference to be big in order to emphasize the fact that this are different face images.
Triplet loss is explained in more details through diagrams on Java Machine Learning for Computer Vision by giving also a slightly more formal definition. Anyway after some simple math steps the combined formula for positive and negative case comparisons with anchor images looks like below:
ε is introduced to prevent neural network from finding weights such that the distance between images for negative and positive case can be the same (therefore the difference will be zero and easily satisfy the condition). In this way neural network has to work harder to make sure that at least there is a minimum distance ε between a positive and negative case.
Iteration by iteration neural network will try to learn the above function in order to satisfy the equation. It will try to push the positive case case difference (green equation) to lower values and try to push the negative case difference to larger values(red equation) by a difference value at least -ε (moving ε on the other side of equation).
Choosing triplets has a really big impact on how well and efficiently the network learns. So when we need to carefully choose the triplets following below guideline::
The code can be freely found on github as part of the video course. Although if offers all the flexibility to develop or borrow existing models, deeplearning4j face recognition has some known issues and is not offering yet pre-trained weights through transfer learning.
So in in order to build the java application we will need to use the weights from existing Keras OpenFace model found on github repository.
buildBlock3a(graph); buildBlock3b(graph); buildBlock3c(graph); buildBlock4a(graph); buildBlock4e(graph); buildBlock5a(graph); buildBlock5b(graph); graph.addLayer("avgpool", new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.AVG, new int[]{3, 3}, new int[]{1, 1}) .convolutionMode(ConvolutionMode.Truncate) .build(), "inception_5b") .addLayer("dense", new DenseLayer.Builder().nIn(736).nOut(encodings) .activation(Activation.IDENTITY).build(), "avgpool") .addVertex("encodings", new L2NormalizeVertex(new int[]{}, 1e-12), "dense") .setInputTypes(InputType.convolutional(96, 96, inputShape[0])).pretrain(true); /* Uncomment in case of training the network, graph.setOutputs should be lossLayer then .addLayer("lossLayer", new CenterLossOutputLayer.Builder() .lossFunction(LossFunctions.LossFunction.SQUARED_LOSS) .activation(Activation.SOFTMAX).nIn(128).nOut(numClasses).lambda(1e-4).alpha(0.9) .gradientNormalization(GradientNormalization.RenormalizeL2PerLayer).build(), "embeddings")*/ graph.setOutputs("encodings");
static void loadWeights(ComputationGraph computationGraph) throws IOException { Layer[] layers = computationGraph.getLayers(); for (Layer layer : layers) { List<double[]> all = new ArrayList<>(); String layerName = layer.conf().getLayer().getLayerName(); if (layerName.contains("bn")) { all.add(readWightsValues(BASE + layerName + "_w.csv")); all.add(readWightsValues(BASE + layerName + "_b.csv")); all.add(readWightsValues(BASE + layerName + "_m.csv")); all.add(readWightsValues(BASE + layerName + "_v.csv")); layer.setParams(mergeAll(all)); } else if (layerName.contains("conv")) { all.add(readWightsValues(BASE + layerName + "_b.csv")); all.add(readWightsValues(BASE + layerName + "_w.csv")); layer.setParams(mergeAll(all)); } else if (layerName.contains("dense")) { double[] w = readWightsValues(BASE + layerName + "_w.csv"); all.add(w); double[] b = readWightsValues(BASE + layerName + "_b.csv"); all.add(b); layer.setParams(mergeAll(all)); } } }
Basically this are the main parts of the application apart from Java SWING GUI and other low level utilities which can be freely explored in the code.
It is possible to run the from source by simply executing the RunFaceRecognition class. After running the application a Jaa GUI will be shown as below:
It was possible to register your own images(Register new Member button) which will be shown as member below and than try if other picture of new member(Choose Face Image) will match or not(Who Is? button ).
In future further consolidation may needed in the way we load the weights. So right now the model may still need some tuning so please stay tuned as the code it will continually improved to a state of the art accuracy.
Please notice that the open face model is quite small comparing to real systems so the accuracy may not be the best but it is quite promising and it clearly shows the power of the explored concept on this post.
ENJOY :)!
]]>Neural Style transfer is the process of creating a new image by mixing two images together. Lets suppose we have this two images below:
and the generated art image will look like below:
and since we like the art on the right image would like to transfer that style into our own memory photos. Of course we would prefer to save the photos content as much as possible and in same time transform them according to the art image style. This may look like :
We need to find a way to capture content and style image features so we can mix them together in way that the output will look satisfactory for the eye.
Deep Convolution Neural Networks like VGG-16 are already in a way capturing this features looking at the fact that they are able to classify/recognize a large variety of images(millions) with quite a high accuracy. We just need to look deeper on neural layers and understand or visualize what they are doing.
Already a great paper offers the insight : Zeiler and Fergus, 2013 Visualizing and Understanding Convolutional Networks. They have developed quite a sophisticated way to visualize internal layersby using Deconvolutional Networks and other specific methods. In here we will focus only on the high level intuition of what neural layers are doing.
Lets first bring into the focus VGG-16 architecture we saw in Cat Image Recognition Application :
While training with images lets suppose we pick the first layer and start monitoring some of his units/neurons(9 to 12 usually) activation values. From all activation’s values lets pick 9 maximum values per each of the chosen units(9-12). For all of this 9 values we will visualize the patch of the images that cause those activation to maximize. In few words the part of image the is making those neurons fire bigger values.
Since we are just in the first layer the units capture only small part of the images and rather low level features as below:
It looks like the 1st neuron is interested in diagonal lines while the 3rd and 4th in vertical and diagonal lines and the 8th for sure likes green color. Is noticeable that all this are really small part of images and the layer is rather capturing low level features.
Lets move a bit deeper and choose the 2 layer:
This layer neurons start to detect some more features like the second detects thin vertical lines , the 6th and 7th start capturing round shapes and 14th is obsessed with yellow color.
Deeper into 3rd layer:
Well this layer for sure starts to detect more interesting stuff like 6th is more activated for round shapes that look like tires, 10th is not easy to explain but likes orange and round shapes while the 11th start even detecting some humans.
Even Deeper… into layer 4 and 5:
So deeper we go bigger part of the image neurons are detecting therefore capturing high level features(second neuron on 5 layer is really into dogs) of the image in comparison to low level layers capturing rather small part of image.
This gives a great intuition on what deep convolutional layers are learning and also coming back to our style transfer we have the insight how to generate art and keep the content from two images. We just need to generate a new image which when feed to neural networks as input generates more or less same activation values as the content(photo) and style(art painting) image.
One great thing about deep learning in general is the fact that is highly portable between application and even different programming languages and frameworks. The reason is simply because what a deep learning algorithm produces is just weights which are simply decimal values and they can be easily transported and imported on different environments.
Anyway for our case we are going to use VGG-16 architecture pre trained with IMAGENET. Usually VGG-19 is used but unfortunately it results too slow on CPU , maybe on GPU it will be better. Below java code:
private ComputationGraph loadModel() throws IOException { ZooModel zooModel = new VGG16(); ComputationGraph vgg16 = (ComputationGraph) zooModel.initPretrained(PretrainedType.IMAGENET); vgg16.initGradientsView(); log.info(vgg16.summary()); return vgg16; }
At the beginning we have only content image and styled image so the combined image is rather a noisy image. Loading images is fairly easy task :
private static final DataNormalization IMAGE_PRE_PROCESSOR = new VGG16ImagePreProcessor(); private static final NativeImageLoader LOADER = new NativeImageLoader(HEIGHT, WIDTH, CHANNELS);
INDArray content = loadImage(CONTENT_FILE); INDArray style = loadImage(STYLE_FILE);
private INDArray loadImage(String contentFile) throws IOException { INDArray content = LOADER.asMatrix(new ClassPathResource(contentFile).getFile()); IMAGE_PRE_PROCESSOR.transform(content); return content; }
Please note that after loading the pixels we are normalizing(IMAGE_PRE_PROCESSOR) the pixels with the mean values from all images used during training of VGG-16 with ImageNet dataset. Normalization helps to speed up training and is something it is more or less always done.
Now is time to generate a noisy image :
private INDArray createCombinationImage() throws IOException { INDArray content = LOADER.asMatrix(new ClassPathResource(CONTENT_FILE).getFile()); IMAGE_PRE_PROCESSOR.transform(content); INDArray combination = createCombineImageWithRandomPixels(); combination.muli(NOISE_RATION).addi(content.muli(1.0 - NOISE_RATION)); return combination; }
As we can see from the code the combined image is not purely noisy but some part of it is taken from content(NOISE_RATION controls the percentage). The idea is taken from this tensor flow implementation and it is done for speeding up the training therefore getting good results faster. Anyway the algorithm eventually will produce more or less same results with pure noise images but it will just take longer and more iterations.
As we mention earlier we will use intermediate layers activation values produced by a neural network as a metric showing how similar two images are. First lets get those layer activation’s by doing forward pass for the content and combined image using VGG-16 pretrained model:
Map<String, INDArray> activationsContentMap = vgg16FineTune.feedForward(content, true);
Map<String, INDArray> activationsCombMap = vgg16FineTune.feedForward(combination, true);
Now per each image we have a map with layer name as key and activation’s values on that layer as value. We will choose a deep layer(conv4_2) for our content image cost function because we want to capture as high level as possible features. The reason we choose a deep layer is because we would like the combine image or the generated image to retain the look and shape of content. In same time we choose only one layer because we don’t want the combine image to look exactly like content but rather leave some space for the art.
Once we have activation’s for chosen layer for both images content and combine is time to compare them together and see how similar they are. In order to measure their similarity we will use their squared difference divided by activation dimensions as described by this paper:
F^{ij }denotes the combine image layer activation values and P^{ij }content image layer activation values. Basically is just the euclidian distance between two activation’s in particular layer.
What we want is that ideally the difference to be zero. In few words minimize as much as possible the difference between images features on that layer. In this way way we transferred features captured by that layer from content image to combine image.
The implementation in java of the cost function will look like below:
public double contentLoss(INDArray combActivations, INDArray contentActivations) { return sumOfSquaredErrors(contentActivations, combActivations) / (4.0 * (CHANNELS) * (WIDTH) * (HEIGHT)); }
public double sumOfSquaredErrors(INDArray a, INDArray b) { INDArray diff = a.sub(b); // difference INDArray squares = Transforms.pow(diff, 2); // element-wise squaring return squares.sumNumber().doubleValue(); }
The only non essential difference with the mathematical formula from the paper is division of the activation dimension rather than with 2.
The approach for style image is quite similar with the content image in way that we will still use neural layers activation’s values difference as similarity measurement of images. Anyway there some difference with the cost function for style images in how the activation’s values are processed and calculated.
Recalling from previous convolution layers post and cat recognition application a typical convolution operation will results in an output with several channels(3rd dimension) beside height and width(e.x 16 X 20 X 356 , w X h X c). Usually convolution shrinks width and height and increases channels.
Style is defined as the correlation between each of units across channels in a specific chosen layer. E.x if we have a layer with shape 12X12X32 than if we pick up the 10th channel all 12X12=144 units of the 10th channel will be correlated with all 144 units of each of the other channels like 1,2,3,4,5,6,7,8,9, 11,12….32.
Mathematically this is called more specifically the Gram Matrix(G) and is calculated as the multiplication of the units values across channels in a layer.If values are almost the same than the Gram will output a big value in contrast when they values are completely different. So gram signals captures how related different channels are with each other(like correlation intuition). From the paper it will look like below:
l the chosen layer, k is an index that iterates over channels in a layer, notice k_{*} is not iterating because this is the channel we compare with all other channels, i and j are referring to the unit
The implementation in java looks like below:
public double styleLoss(INDArray style, INDArray combination) { INDArray s = gramMatrix(style); INDArray c = gramMatrix(combination); int[] shape = style.shape(); int N = shape[0]; int M = shape[1] * shape[2]; return sumOfSquaredErrors(s, c) / (4.0 * (N * N) * (M * M)); }
public INDArray gramMatrix(INDArray x) { INDArray flattened = flatten(x); INDArray gram = flattened.mmul(flattened.transpose()); return gram; }
Once we have the Gram Matrix we do the same as for the content calculate the Euclidean distance so the squared difference between Gram Matrices of the combine and style images activation’s values.
G^{ij}_{l} is denoting the combine gram values and A^{ij}_{l} the style gram values on specific layer l.
There is one last detail about the style cost function, usually choosing more than one layer gives better results. So for final style cost function we are going to choose 4 layers and add them together:
E is just the equation above and w^{1 }denotes to a weight per layer so we are controlling the impact or the contribution of each layer. We maybe want lower layer to contribute less than upper layer but still have them.
Finally in java it looks like below:
private static final String[] STYLE_LAYERS = new String[]{ "block1_conv1,0.5", "block2_conv1,1.0", "block3_conv1,1.5", "block4_conv2,3.0", "block5_conv1,4.0" };
private Double allStyleLayersLoss(Map<String, INDArray> activationsStyleMap, Map<String, INDArray> activationsCombMap) { Double styles = 0.0; for (String styleLayers : STYLE_LAYERS) { String[] split = styleLayers.split(","); String styleLayerName = split[0]; double weight = Double.parseDouble(split[1]); styles += styleLoss(activationsStyleMap.get(styleLayerName).dup(), activationsCombMap.get(styleLayerName).dup()) * weight; } return styles; }
Total cost measures how fare or different the combine image is from content image features from the selected layer and from features selected from multiple layers on style image. In order to have control over how much we want our combine image to look as content or style two wights are introduced α and β:
Increasing α will cause combine image to look more like content while increasing β will cause combine image to have more style. Usually we decrease α and increase β.
By now we have a great way to measure how similar the combine image is with the content and the style. What we need now is to react on the comparison result in order to change the combine image so that next time will have less difference or lower cost function value. Step by step we will change the combine image to become closer and closer to content and style images layers features.
The amount of change is done by using derivation of the total cost function. The derivation simply gives a direction to go. We than multiply the derivation value by an coefficient α which simply defines how much you want to progress or change. You don’t want small values as it will take a lot of iteration to improve where bigger values will make the algorithm never converge or producing unstable values(see here for more).
If it were TensorFlow we will be done by now since it handles the derivation or the cost function automatically for us. Although deeplearning4j requires manual calculation of derivation(n4dj it offers some autodiff feel free to experiment for automatic derivation) and is not design to work in the way style transfer learning requires , it has all the flexibility and pieces to build the algorithm.
Thanks to Jacob Schrum we were able to build derivation implementation in java please find the details on deeplearning4j examples on github class implementation originally started at MM-NEAT repository.
The last step is to update the combine image with the derivation value(multiplied by α as well):
AdamUpdater adamUpdater = createADAMUpdater(); for (int iteration = 0; iteration < ITERATIONS; iteration++) { log.info("iteration " + iteration); Map<String, INDArray> activationsCombMap = vgg16FineTune.feedForward(combination, true); INDArray styleBackProb = backPropagateStyles(vgg16FineTune, activationsStyleGramMap, activationsCombMap); INDArray backPropContent = backPropagateContent(vgg16FineTune, activationsContentMap, activationsCombMap); INDArray backPropAllValues = backPropContent.muli(ALPHA).addi(styleBackProb.muli(BETA)); adamUpdater.applyUpdater(backPropAllValues, iteration); combination.subi(backPropAllValues); log.info("Total Loss: " + totalLoss(activationsStyleMap, activationsCombMap, activationsContentMap)); if (iteration % SAVE_IMAGE_CHECKPOINT == 0) { //save image can be found at target/classes/styletransfer/out saveImage(combination.dup(), iteration); } }
So we simply subtract the derivation value from combine images pixels each iteration and the cost function grantees each iteration we come closer to the image we want. In order the algorithm to be effective we update using ADAM which is simply helps gradient descent to converge more stably. Basically a simpler Updater will work fine as well but it will take slightly more time.
What we described so fare is Gradient Descent more specifically Stochastic Gradient Descent since we are updating only one sample at time. Usually for transfer learning is used L-BFGS but with deeplearning4j will be harder and I didn’t have an insight how to approach it.
Originally the case was implemented at MM-NEAT together with Jacob Schrum but later one was contributed to deeplearning4j-examples project so feel free to download from any of the source(from dl4j is slightly refactored).
Basically the class code can be easily copied and run on different project as it has no other dependencies beside deeplearning4j of course.
Usually to get descent results you need to run minimum 500 iteration but 1000 is more often recommended while 5000 iteration produces really high quality images. Anyway expect to let the algorithm run for couple of hours(3-4) for 1000 iterations.
There few parameters we can play in order to affect the combine image to what it looks best to us:
Please find below some show cases. Feel free to share more show cases since exist many more interesting art mixtures out there.
1.
2.
3.
Well you are not alone , as a Java Developer with more than 10 years of experience and several java certification I understand the obstacles and how you feel.
From my experience I know what obstacles a Java software engineering faces with the Deep Learning so I can be of a great help to you in making the journey with deep learning an exciting experience.
In this post we are going to build a Java Real Time Video Object Detection Application for Car Detection, the key component in autonomous driving systems. In previous post we were able to build an image classifier(cats&dogs) while now we are going to also detect the object(cars,pedestrians) and additionally mark them with bounding boxes(rectangles). Feel free to download code or run the application with your own videos(short live example).
First we have the problem of Object Classification in which we have an image and want to know if this image contains any of particular categories like : image contains a car VS image doesn’t contain any car
We saw how to build an image classifier in previous post using existing architecture like VGG-16 and transfer learning.
Now that we can say with high confidence if an image has a particular object or not, rises the challenge to localize the object position in the image. Usually this is done by marking the object with a rectangle or bounding box.
Beside classification of the image we need to additionally specify the position of the object in the image.This is done by defining a bounding box.
A bounding box is usually represented by the center( b^{x }, b^{y} ) , rectangle height ( b^{h} ), rectangle width( b^{w} ).
Now we need to define this four variables in our training data per each of the objects in the images. Than the network will output not only the probability for the image class number(20% cat[1], 70% dog[2], 10% tiger[3]) but also the four variables defining the bounding box of the objects.
Notice that just by providing the bounding boxes points(center,width,height) now our model is able to predict more information by giving us a more detailed view of the content.
It will not be hard to imagine that adding more points in the image can provide us even greater insight into the image. Like e.g adding more points position in the human face (mouth, eyes)can tell us if the person is smiling, crying , angry or happy.
We can go even a step further by localizing not only one object but rather multiple objects in the image and this will lead us to the Object Detection Problem.
Although the structure does not change much the problem here becomes a harder because we need more data preparation(multiple bounding boxes per image).
In principle this problem is solved by dividing the image into smaller rectangles and per each rectangle we have the same additional five variables we already saw P^{c , }( b^{x }, b^{y} ) , b^{h} , b^{w} and of course the normal prediction probabilities(20% cat[1], 70% dog[2]). We will see more details on this later but for now lets say the structure is the same as Object Localization with the difference we have multiple of that structure.
This is a quite intuitive solution that one can come up by his/her self. The idea is that instead of using general car images we crop as much as possible so the image contains only the car.
Than we train a Convolution Neural Network similar to VGG-16 or any other deep network with the cropped images.
This works quite well with the exception than now the model was trained to detect a images that have only cars so it will have troubles with real images since they contain more than just a car(trees , people, sings…).
The size of real images is way bigger than the cropped images and can contain many cars as well.
To overcome those problems we can analyse only a small part of the image and try to figure out if that part has a car or not. More precisely we can scan the image with a sliding rectangle window and each time let our model to predict if there is a car in it or not. Lets see an example :
To summarize we use a normal Convolution Neural Network( VGG-16 ) to train the model with crop images of different sizes and than use this rectangle sizes to scan the images for the objects(cars). As we can see by using different sizes of rectangles we can figure out quite different shape of cars and their positions.
This algorithm is not that complex and it works. But anyway it has two downsides:
2.Even with different bounding box sizes we may fail to precisely mark the object with a bounding box. The model may not output very accurate bounding boxes like the box may include only a part of the object.Next sections will explore YOLO(You Look Only Once) algorithm which solves this problem for us.
We saw how Sliding Window had performance problems due to the fact that it didn’t reuse many of already computed values. Each time the window was moved we had to execute all the model(million of parameters) for all pixels in order to get a prediction. In reality most of the computation there could be reused by introducing Convolution.
We saw in previous post that image classification architectures regardless with their size and configuration in the end they used to feed a fully connected Neural Network with different number of layers and output several prediction depending on classes.
For simplicity we will take a smaller network model as example but anyway the same logic is valid of any convolutional network(VGG-16, AlexNet).
For a more detailed explanation of convolution and his intuition please have a look at one of my previous posts. This simple network takes as input a colored image(RGB) of size 32 X 32 X 3. It uses a Same Convolution(this convolution leaves first two dimensions width X height unchanged) 3 X 3 X 64 to get an output 32 X 32 X 64(notice how the third dimension is same as convolution matrix 64, usually increased from input). It uses a Max Pooling layer to reduce width and height and leaving the third dimension unchanged 16 X 16 X 64. After that we feed a Fully Connected Neural Network with two hidden layers 256 and 128 neurons each. In the end we output probabilities(usually using soft-max) for 5 classes.
Now lets see how we can replace Fully Connected Layers to Convolution Layers while leaving the mathematical effect the same(linear function of the input 16 X 16 X 64).
So what we did is just replace Fully Connected layers with Convolution Filters. In reality 16 X 16 X 256 convolution filter is 16 X 16 X 64 X 256 matrix ( why? multiple filters) .Third dimension 64 is always same as the input third dimension 16 X 16 X 64 sofor the sake of simplicity it is referred as 16 x 16 x 256 by omitting the 64. This means that actually this is equivalent to Fully Connected Layer because :
out:1 X 1 X 256 = in:[16 X 16 X 64] * conv:[16 X 16 X 64 X 256]
so every element of the output 1 X 1 X 256 is a linear function of every element of the input 16 X 16 X 64.
The reason why we bothered to convert Fully Connected(FC) Layers to Convolution Layers is because this will give us more flexibility in the way output is reproduced. With FC you will always will have the same output size which is number of classes.
To see the great the idea behind replacing FC with convolution filter we will need to take in input image that is bigger than original one of 32 X 32 X 3. So lets take an input image of 36 X 36 X 3:
So this image has 4 columns and 4 rows(with green 36 X 36) more than original image(blue 32 X 32).If we would use sliding window with Stride=2 and FC than we need to move the original image size 9 times to cover all (e.x 3 moves shown with black rectangle) therefore execute model 9 times as well.
Lets try now to apply this new bigger matrix as input to our new model with Convolution Layers only.
Now as we can see the output changed from 1 X 1 X 5 to 3 X 3 X 5 in comparison to FC where the output will always be 1 X 1 X 5. Recalling in the case of Sliding Window we had to move sliding window 9 times to cover all image , wait isn’t that equal to 3X3 output of convolution? Yes indeed those 3X3 cells each represent the Sliding Windows 1 X 1 X 5 classes probability prediction results! So instead of having only one 1 X 1 X 5 as output now with one shot we have 3 X 3 X 5 so all 9 combinations without needed to move and execute 9 times the sliding window.
This is really sate of the art technique as we were able just in one pass to get all 9 results immediately without needed to execute model with millions of parameters several times.
Although we already addressed the performance problem by introducing convolution sliding window our model still may not output very accurate bounding boxes even with several bounding boxes sizes. Lets see how YOLO solves that problem as well.
First we normally go on each image and mark the objects we want to detect. Each object is marked by a bounding box with four variables center of the object( b^{x }, b^{y} ) , rectangle height ( b^{h} ), rectangle width( b^{w} ). After that each image is split into smaller number of rectangles(boxes) , usually 13 X 13 rectangles but here for simplicity is 8 X 9.
The bounding box(red) and the object can be part of several boxes(blue) so we assign the object and the bounding box only to the box owning the center of the object(yellow boxes). So we train our model with four additional(beside telling the object is a car)variables (b^{x }, b^{y} ,b^{h} ,b^{w} ) and assign those to the box owning the center b^{x }, b^{y} . Since the neural network is trained with this labeled data it also predicts this four variables(beside what object is) values or bounding boxes.
Instead of scanning with predefined bounding box sizes and trying to fit the object we let the model learn how to mark objects with bounding boxes therefore the bounding boxes now are flexible(are learned) . This way the accuracy of the bounding boxes is much higher and flexible.
Lets see how we can represent the output now that we have additional 4 variables(b^{x }, b^{y} ,b^{h} ,b^{w} ) beside classes like 1-car,2-pedestrian… In reality there is added also another variable P^{c }which simply tells if the image has any of the objects we want to detect at all.
We need to label our training data in some specific way so the YOLO algorithm will work correctly. YOLO V2 format requires bounding box dimensions b^{x} ,b^{y}^{ }and b^{h} ,b^{w }to be relative to original image width and height. Lets suppose we have an image 300 X 400 and the bounding box dimension are B^{width} =30 , B^{height}=15 , B^{x}=150 ,B^{y}=80. This has to be transformed to:
B^{w} =30/300 , B^{h}=15/400 , B^{x}=150/300 ,B^{y}=80/400
This post shows how to label data using BBox Label Tool with less pain. The tool labels bounding boxes a bit different(gives up left point and lower right point) from YOLO V2 format but converting is fairly straight forward task.
Regardless how YOLO requires the labeling of training data , internally the prediction is done a bit differently.
A predicted bounding box by YOLO is defined relatively to the box that owns the center of the object(yellow). The upper left corner of the box start from (0,0) and the bottom right (1,1). So the center (b^{x }, b^{y)} is for sure on range 0-1(sigmoid function makes sure) since the point is inside box. While b^{h} ,b^{w} are calculated in proportion to w and h values(yellow) of the box so values can be greater than 1(exponential used for positive values). In the picture we can see that the width b^{w} of the bounding box is almost 1.8 the size of the box width w. Similarly b^{h} is approx 1.6 the size of box height h.
After prediction we see how much the predicted box intersects with the real bounding box labeled at the beginning. Basically we try to maximize the intersection between them so ideally the predicted bounding box is fully intersecting to labeled bounding box.
In principle that’s it! You provide more specialized labeled data with bounding boxes(b^{x }, b^{y} ,b^{h} ,b^{w} ) , split the image and assign to the box containing the center(the only responsible for detecting the object) , train with ‘Convolution Sliding Window Network’ and predict the object and his position.
Although we are not going to give a very detailed explenation in this post in reality there are yet two more small problems to solve:
Training deep networks takes a lot of effort and requires significat data and processing power. So as we did in previous post we will use transfer learning. This time we are not going to modify the architecture and train with different data but rather use the network directly.
We are going to use Tiny YOLO ,citing from site:
Tiny YOLO is based off of the Darknet reference network and is much faster but less accurate than the normal YOLO model. To use the version trained on VOC:
wget https://pjreddie.com/media/files/tiny-yolo-voc.weights ./darknet detector test cfg/voc.data cfg/tiny-yolo-voc.cfg tiny-yolo-voc.weights data/dog.jpg
Which, ok, it’s not perfect, but boy it sure is fast. On GPU it runs at >200 FPS.
Current release version of deeplearning4j 0.9.1 does not offer TinyYOLO but the 0.9.2-SNAPSHOT it does. So first we need to tell maven to load SNAPSHOT:
<repositories> <repository> <id>a</id> <url>http://repo1.maven.org/maven2/</url> </repository> <repository> <id>snapshots-repo</id> <url>https://oss.sonatype.org/content/repositories/snapshots</url> <releases> <enabled>false</enabled> </releases> <snapshots> <enabled>true</enabled> <updatePolicy>daily</updatePolicy> </snapshots> </repository> </repositories> <dependencies> <dependency> <groupId>org.deeplearning4j</groupId> <artifactId>deeplearning4j-core</artifactId> <version>${deeplearning4j}</version> </dependency> <dependency> <groupId>org.nd4j</groupId> <artifactId>nd4j-native-platform</artifactId> <version>${deeplearning4j}</version> </dependency>
Than we are ready to load the model with fairly short code:
private TinyYoloPrediction() { try { preTrained = (ComputationGraph) new TinyYOLO().initPretrained(); prepareLabels(); } catch (IOException e) { throw new RuntimeException(e); } }
prepareLabels is just using labels from the dataset PASCAL VOC that used to train the model. Feel free to run preTrained .summary() to see model architecture details.
The video frames are captures using JavaCV with CarVideoDetection:
FFmpegFrameGrabber grabber; grabber = new FFmpegFrameGrabber(f); grabber.start(); while (!stop) { videoFrame[0] = grabber.grab(); if (videoFrame[0] == null) { stop(); break; } v[0] = new OpenCVFrameConverter.ToMat().convert(videoFrame[0]); if (v[0] == null) { continue; } if (winname == null) { winname = AUTONOMOUS_DRIVING_RAMOK_TECH + ThreadLocalRandom.current().nextInt(); } if (thread == null) { thread = new Thread(() -> { while (videoFrame[0] != null && !stop) { try { TinyYoloPrediction.getINSTANCE().markWithBoundingBox(v[0], videoFrame[0].imageWidth, videoFrame[0].imageHeight, true, winname); } catch (java.lang.Exception e) { throw new RuntimeException(e); } } }); thread.start(); } TinyYoloPrediction.getINSTANCE().markWithBoundingBox(v[0], videoFrame[0].imageWidth, videoFrame[0].imageHeight, false, winname); imshow(winname, v[0]);
So what the code is doing is getting frames from the video and passing to the TinyYOLO pre trained model. From there image frame is first scaled to 416X416X3(RGB) and than is given to TinyYOLO for predicting and marking the bounding boxes:
public void markWithBoundingBox(Mat file, int imageWidth, int imageHeight, boolean newBoundingBOx,String winName) throws Exception { int width = 416; int height = 416; int gridWidth = 13; int gridHeight = 13; double detectionThreshold = 0.5; Yolo2OutputLayer outputLayer = (Yolo2OutputLayer) preTrained.getOutputLayer(0); INDArray indArray = prepareImage(file, width, height); INDArray results = preTrained.outputSingle(indArray); predictedObjects = outputLayer.getPredictedObjects(results, detectionThreshold); System.out.println("results = " + predictedObjects); markWithBoundingBox(file, gridWidth, gridHeight, imageWidth, imageHeight); imshow(winName, file); }
After the prediction we should have ready the predicted values of bounding box dimensions.We have implemented also Non-Max Suppression (removeObjectsIntersectingWithMax) because as we mention YOLO at testing time predicts more than one bounding boxes per object. Rather than using b^{x }, b^{y} ,b^{h} ,b^{w } will use topLeft and bottomRight points. gridWidth and gridHeight are the number of the small boxes we split our image into, which in our case is 13X13. w,h are original image frame dimensions.
private void markWithBoundingBox(Mat file, int gridWidth, int gridHeight, int w, int h, DetectedObject obj) { double[] xy1 = obj.getTopLeftXY(); double[] xy2 = obj.getBottomRightXY(); int predictedClass = obj.getPredictedClass(); int x1 = (int) Math.round(w * xy1[0] / gridWidth); int y1 = (int) Math.round(h * xy1[1] / gridHeight); int x2 = (int) Math.round(w * xy2[0] / gridWidth); int y2 = (int) Math.round(h * xy2[1] / gridHeight); rectangle(file, new Point(x1, y1), new Point(x2, y2), Scalar.RED); putText(file, map.get(predictedClass), new Point(x1 + 2, y2 - 2), FONT_HERSHEY_DUPLEX, 1, Scalar.GREEN); }
After that using another thread(beside the one playing the video) we update the video to have the rectangle and the label for the object it was detected.
The prediction is really fast(real-time) considering that is run on CPU, on GPUs we will have even better real time detection.
Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer. Feel to try with your own videos.
It is possible to run the from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java.
After running the application you should be able to see below view:
Enjoy :)!
]]>Well you are not alone , as a Java Developer with more than 10 years of experience and several java certification I understand the obstacles and how you feel.
From my experience I know what obstacles a Java software engineering faces with the Deep Learning so I can be of a great help to you in making the journey with deep learning an exciting experience.
In this post we are going to develop a Cat&Dog Recognizer Java Application using deeplearning4j.If you would like to experiment on your own cat or dog feel free to check out the source code or download the application(fairly short instructions at the end).
Although with the great progress of deep learning, computer vision problems tend to be hard to solve. One of the reason is because Neural Networks(NN) are trying to learn a highly complex function like Image Recognition or Image Object Detection. We have a bunch of pixels values and from there we would like to figure out what is inside, so this really is a complex problem on his own.
Another reason why even today Computer Vision struggle is the amount of date we have. For sure the amount of data we have now is way bigger than before but still it looks like is not enough for Computer Vision problems. In particular Image Object Detection has even less data in comparison to Image Recognition(is a cat? is a dog? is a flower?) because it requires more intensive data labeling(going in each image and specifically mark each object).
Because Computer Vision is hard, traditionally it has developed complex architectures and techniques to achieve better results. We saw in previous post how adding Convolution(specialized image feature detectors) to Neural Networks greatly improved the performance in handwritten digit recognizing problem(97% to 99.5%) but in the same time introduced higher complexity ,parameters and greatly increased training time(to more than 2 hours).
Usually a NN that worked for particular image recognition problem can also work for other image related problems. So fortunately there are several ways we can approach Computer Vision problems and still be productive and have great results:
We can re-use already successfully known architectures by reducing the time needed for choosing different neural hidden layers, convolution layers other configuration parameters(learning rate).
We can re-use already trained Neural Networks(maybe someone already let NN to learn for weeks or months) by cutting the training time with great factor(transfer learning)
Play with training data by cropping, color change, rotate… to obtain more data so we can help NN learn more and be smarter.
Lets see how we can solve the problem of Detecting a Cat&Dog!
This is a classical Neural Network architecture successfully used on handwritten digit recognizer problem back in 1998. You can find more information also for other versions of LeNet architecture here . There is an already existing implementation in deeplearning4j library in github(although not exactly as the paper).
LeNet – 5 architecture looks like below(if not familiar with convolution please have a quick look here):
In principle this architecture introduced the idea of applying several convolution and pooling layer before going to connect to a Neural Network and than to the outputs.
So it takes as input a 32x32x1(third dimension is one for black and white for RGB it will be 3) matrix than applies a 6 Convolution 5×5 matrices which by applying formula described in details here gives a 28x28x6 matrix. Notice how that third dimension is equal to the number of convolution matrices. Usually convolution will reduce first two dimension(width X height) but increase the third dimension(channels).
After that we apply a 2×2 with stride 2 Max Pooling Layer(in paper was average pool) which gives a matrix 14x14x6. Notice how Pooling Layer left the third dimension unchanged but reduced first two(width X height) by dividing with 2 so Pooling Layers are used to reduce only first two dimensions.
Additionally we apply 16 Convolution 5×5 matrices which gives a 10x10x16 and than by adding 2×2 Max Pooling we end up with 5x5x16.
We use the output 5x5x16 of several convolution and pooling to feed a 500 neural network with only one hidden layer and 10 outputs(0-9 digits). The model has to learn approx. 60.000 parameters.
According to paper this model was able to achieve a 99.05 which is impressive!
This is rather a more modern architecture(2012) which works on RGB colored imaged and has way more convolutions and full connected Neurons. This architecture showed great results and therefore convinced a lot of people that deep learning works pretty well for image problems. Anyway we will see that in a way this is similar to LeNet – 5 just bigger and deeper because at that time the processing power was also way greater(gpu’s were widely introduced).
There is also an already existing implementation in deeplearning4j library in github.
The architecture will look like below:
We start with more pixels and also colored images 224x224x3 RGB image. In principle is the same as LeNet – 5 above but just with more convolutions and pooling layers.Convolutions are used to increase third dimension and usually leave first two dimension unchanged(except the first one with stride s=4). Pooling layers are used to decrease(usually by dividing with two) the first two dimension(width X height) and leave the third dimension untouched. If you are wondering about Conv Same it simply means leave two first dimension (width X height) unchanged. Following formulas described on previous post is fairly easy to get same values as in picture.
After adding several convolution and pooling layers we end up with a 6x6x256 matrix which is used to feed a big Neural Network with three hidden layers respectively 9216, 4096,4096.
AlexNet is trying to detect more categories,1000 of them in comparison to LeNet – 5 which had only 10(0-9 digits) in same time it has way more parameters to learn approx. 60 million(100 times more than LeNet – 5).
This architecture from 2015 beside having even more parameters is also more uniform and simple. Instead of having different sizes of Convolution and pooling layers VGG – 16 uses only one size for each of them and than just applying them several times.
There is also an already existing implementation in deeplearning4j library in github.
It always uses Convolution Same 3X3XN with stride S=1, the third dimension differs from time to time to increase/decrease the third dimension(N). Also it uses Max Pooling 2×2 stride S=2, pooling layer always have the same third dimension value as input(they play only with width and height) so we do not show the third dimension. Lets see how this architecture will look like:
Notice again how step by step height and width was decreased by adding Pooling Layers and channels(third dimension) increased by adding Convolutions. Although the model is bigger in same time is easier to read and understand thanks to the uniform way of using convolution and pooling layers.
This architecture has 138 million parameters , approx 3 times more than AlexNet(60 million) and similarly it try to detect 1000 image categories.
There more architectures even bigger and deeper than the three above. For implementation and list of the other architecture please refer at deeplearning4j classes list on github. But just to mention a few there is also:
One great thing about Machine Learning applications is that they are highly portable between different frameworks and even programming languages. Once you trained a neural network what you get is a bunch or parameters values(decimal values). In case of LeNet-5 60.000 parameter values, AlexNet 60 million and for VGG- 16 138 million. Than we use those parameters values to classify new coming images into one of 1000 in case of AlexNet and VGG-16 and 10 for LeNet-5.
In a few words the most valuable part of our application are the parameters. If we save parameters on disk and load them later we will get the same result as prior to saving (for same previously predicted images). Even if we save with python and load with java(or other way around) we will get the same result assuming the Neural Network implementation is correct on both of them.
Transfer learning as the name suggest it transfers already trained neural weights to others. Others can be different machines, operating systems, frameworks, languages like java,python or anything as long as you can read/save the weights values.
It maybe someone else already trained the network for really long time like weeks or months and with transfer learning we can re-use that work in a few minutes and start from there. Beside we get for free the painful tuning of hyper-parameters it is especially useful when we do not have a lot of processing power(someone else trained with thousands of GPU’s).As we will see later deeplearning4j already has the ability to save and load pretrained neural networks even from frameworks like Keras.
There are several things we can do once we load a pretrained neural network:
As we will see deeplearning4j supports freezing layers and adding/removing layers to a pretrained neural network.
As always every Machine Learning problem starts with the data. The amount and quality of data are very crucial for the performance of system and most of the time it requires great deal of effort and resources. So we need to rely on online public data sets as a start and than try to augment or transform existing images to create a larger variety.
For cat&dog recognizer problem fortunately we have a good data set provided by Microsoft. Also the same data set can be found on Kaggle. Originally this is a Dog & Cat data set with 12.500 cat photos and 12.500 dog photos and with 12.500 dog&cat as a test data set.
Since 2010, ImageNet has hosted an annual challenge where research teams present solutions to image classification and other tasks by training on the ImageNet dataset. ImageNet currently has millions of labeled images; it’s one of the largest high-quality image datasets in the world. The Visual Geometry group at the University of Oxford did really well in 2014 with: VGG-16 and VGG-19(results). We will choose VGG-16 trained with ImageNet for our cat problem because it is similar to what we want to predict. VGG-16 with ImageNet already is trained to detect different races of cats&dogs, please find here the list(search with ‘cat’,’dog’).
The size of all trained weights and the model is about 500MB so if you are going to use the code to train it may take few moments to download the pretrained weights first. The code in deeplearning4j for downloading VGG-16 trained with ImageNet looks like below:
ZooModel zooModel = new VGG16(); ComputationGraph pretrainedNet = (ComputationGraph) zooModel.initPretrained(PretrainedType.IMAGENET);
VGG-16 predicts 1000 classes of images while we need only two;if the image is cat or a dog. So we need to slightly modify the model to output only two classes instead of 1000. Everything else we leave it as it is since our problem is similar to what VGG-16 is already trained for(freeze all other layers). Modified VGG-16 will look like below:
In gray is marked the part which is freeze so we do not use any processing power to trained but instead use the weights initial downloaded values.In green is the part we trained so we are going to train only 8192 parameters (from 138 million)from the last layer to the two outputs. The code will look like below:
FineTuneConfiguration fineTuneConf = new FineTuneConfiguration.Builder() .learningRate(5e-5) .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT) .updater(Updater.NESTEROVS) .seed(seed) .build(); ComputationGraph vgg16Transfer = new TransferLearning.GraphBuilder(preTrainedNet) .fineTuneConfiguration(fineTuneConf) .setFeatureExtractor(featurizeExtractionLayer) .removeVertexKeepConnections("predictions") .addLayer("predictions", new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD) .nIn(4096).nOut(NUM_POSSIBLE_LABELS)//2 .weightInit(WeightInit.XAVIER) .activation(Activation.SOFTMAX).build(), featurizeExtractionLayer) .build();
The method which freezes the weights is setFeatureExtractor.From java doc of deeplearning4j :
/** * Specify a layer vertex to set as a "feature extractor" * The specified layer vertex and the layers on the path from an input vertex to it it will be "frozen" with parameters staying constant * @param layerName * @return Builder */ public GraphBuilder setFeatureExtractor(String... layerName) { this.hasFrozen = true; this.frozenOutputAt = layerName; return this; }
So everything from the input to the layer name you defined will be freeze. If you are wondering what is the layer name and how to find than you can print first the model architecture like below:
ZooModel zooModel = new VGG16(); ComputationGraph pretrainedNet = (ComputationGraph) zooModel.initPretrained(PretrainedType.IMAGENET); log.info(pretrainedNet.summary());
After that you will get in console something that looks like below:
Notice that trainable parameters are equal to total parameters 138 million. In our case we are going to freeze from input to the last dense layer which ‘fc2’ so featurizeExtractionLayer variable value will be “fc2”. Please find below a view after freeze:
Notice how names are ending with frozen now and the trainable parameters changed from 138 million to 8194(8192+ 2 bias parameters).
Now we are ready to train the model and this results to fairly few lines of code:
DataSetIterator testIterator = getDataSetIterator(test.sample(PATH_FILTER, 1, 0)[0]); int iEpoch = 0; int i = 0; while (iEpoch < EPOCH) { while (trainIterator.hasNext()) { DataSet trained = trainIterator.next(); vgg16Transfer.fit(trained); if (i % SAVED_INTERVAL == 0 && i != 0) { ModelSerializer.writeModel(vgg16Transfer, new File(SAVING_PATH), false); evalOn(vgg16Transfer, devIterator, i); } i++; } trainIterator.reset(); iEpoch++; evalOn(vgg16Transfer, testIterator, iEpoch); }
We are using a batch size of 16 and 3 epochs. First while loop will be executed three times since epoch=3.Second inner while loop will be executed 1563 (25.000 cats and dogs/16).One epoch is full traversal through the data and one iteration is one forward and back propagation on the batch size(16 images in our case). So our model learns with small steps of 16 images and each time becomes smarter and smarter.
Before was common to not train Neural Networks with batches but rather feed all the data at once and have epochs with bigger values like 100,200… In modern Deep Learning Era due to the really big amount of data this way is not used anymore because is really slow. If we feed the network all the data at once than we will wait until the model iterates all the data(million of images )before making any progress with learning while with batch we have the model learning and progressing faster with small steps. There is more about batch vs no batch and is out of this post scope so we will leave for another post.
You can find the full code used for training in github. For the first time it has to download and unzip 600MB of data images to resources folder, so this may take some time for the first run.
After training on 85% of training set(25000) for few hours(3 hours) we were able to get below results(code used for evaluating):
15% of Training Set Used as Dev Set
Examples labeled as cat classified by model as cat: 1833 times
Examples labeled as cat classified by model as dog: 42 times
Examples labeled as dog classified by model as cat: 31 times
Examples labeled as dog classified by model as dog: 1844 times
==========================Scores==========================
# of classes: 2
Accuracy: 0.9805
Precision: 0.9805
Recall: 0.9805
F1 Score: 0.9806
=========================================================
1246 Cats and 1009 Dogs
Examples labeled as cat classified by model as cat: 934 times
Examples labeled as cat classified by model as dog: 12 times
Examples labeled as dog classified by model as cat: 46 times
Examples labeled as dog classified by model as dog: 900 times
==========================Scores========================================
# of classes: 2
Accuracy: 0.9693
Precision: 0.9700
Recall: 0.9693
F1 Score: 0.9688
========================================================================
Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer. Feel to try with your own cat.
It is possible to run the from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java.
After running the application you should be able to see below view:
Enjoy!
]]>Well you are not alone , as a Java Developer with more than 10 years of experience and several java certification I understand the obstacles and how you feel.
From my experience I know what obstacles a Java software engineering faces with the Deep Learning so I can be of a great help to you in making the journey with deep learning an exciting experience.
In this post we are going to develop a Handwritten Digit Recognition application using Convolutional Neural Networks and java. On previous post a java application was developed using simple Neural Networks by achieving a accuracy of 97% on test data. In this post we will show that Convolutional Networks are more powerful for image recognizing problems therefore producing higher accuracy.Feel free to check out the source code and experiment on your own(fairly short instructions at the end).
If you are not familiar with Neural Networks please feel free to have a quick look at the NN previous post overview otherwise feel free to skip it.
So fare we have seen how Neural Networks (NN) can help in solving more complex problems than simpler algorithms like Logistic Regression or even SVM. The reason behind their success is the fact that they try to solve the problems step by step(hidden layers) rather in on big step. So to summarize in previous post we end up with below graph for NN:
Neural Networks(NN) are with no doubt great but image detection problems are hard so we would need to enhance our NN model with more specialized image feature detectors models. One of this methods is Edge Detection , citing from Wikipedia:
Edge detection includes a variety of mathematical methods that aim at identifying points in a digital image at which the image brightness changes sharply or, more formally, has discontinuities. The points at which image brightness changes sharply are typically organized into a set of curved line segments termed edges.
So intuitively we can notice that an edge is just a group of pixels where a continues color suddenly changes for the first time. In above picture we can see that the left part of the hairs are detected thanks to the fact that the green color changes to yellow color. Same we can say the flower edges are detected because of hand color changes to pink…
Wouldn’t it be great if our NN can now detect edges?
Or even better to learn to detect type of edges by itself?
Or even more better to learn type of edges that will help to improve the model by predicting with higher accuracy?
Images are usually represented as a matrix of pixels where each entry have three values from R(red),G(green), B(blue)(RGB System). So more specifically is a 3 dimensional array image[i][j][k] where i , j represents a pixel position(ixj size can be 1280×800 depending on image resolution) and image[i][j][0] is Red value, image[i][j][1] Green value, image[i][j][2] Blue value.
Lets see how vertical edge can be detected step by step and than we would see why this works and give a high level intuition behind it.
Lets suppose we have a simple image like half white and half black (ignoring the contour)
Lets suppose for the sake of a better visualization that the image resolution is really small like 8×6 so 48 pixels. Additionally we will suppose that the image is represented in the Gray Scale. In contrast to RGB where we have a three dimension matrix like image[i][j][k] (where i,j are pixel position and k(0,1,2) indexes for Red,Green,Blue color values) in the Gray Scale we have only a two dimensional matrix like image[i][j] where i,j is still the pixel position but we have a single value range from 0-255 where 0 is black and 255 is white.
Our simple image will look like below to a computer:
255 | 255 | 255 | 10 | 10 | 10 |
255 | 255 | 255 | 10 | 10 | 10 |
255 | 255 | 255 | 10 | 10 | 10 |
255 | 255 | 255 | 10 | 10 | 10 |
255 | 255 | 255 | 10 | 10 | 10 |
255 | 255 | 255 | 10 | 10 | 10 |
255 | 255 | 255 | 10 | 10 | 10 |
255 | 255 | 255 | 10 | 10 | 10 |
The values there are perfect white 255(for the left side) and almost perfect black 10 (for the right side). The reason we did not put 0 as perfect black is just because a non zero value is better helping to explain the topic.
We will introduce another small matrix 3×3 called a filter(a vertical one).
1 | 0 | -1 |
1 | 0 | -1 |
1 | 0 | -1 |
Now lets introduce a new operation * called convolution as below:
Convolution is fairly simple operation once you see it in action:
As you already notice by now we are just sliding the filter one position horizontally and vertically until the end. Each time we multiply the filters matrix elements with the selected area(sub-matrix) on the image matrix and than add all together. So for position (0,0) we do:
(1 * 255 + 0 * 255 + -1 * 255 )+ (1 * 255 + 0 * 255 + -1 * 255) +( 1 * 255 + 0 * 255 + -1 * 255) = 0
This new operation produces as we can see the below new matrix:
0 | 735 | 735 | 0 |
0 | 735 | 735 | 0 |
0 | 735 | 735 | 0 |
0 | 735 | 735 | 0 |
0 | 735 | 735 | 0 |
0 | 735 | 735 | 0 |
One of the first observation is that this new matrix dimension have shrink from 8×6 to 6×4 pixels(more on this later).
And most importantly we can see that in the middle matrix has high non zero values while on two sides just zeros. Recalling the fact that we wanted to detect vertical edges in a black and white image lets see if this helps. If we transform this matrix back into an image while keeping in mind 0 is black and 255 or greater is white will get something like this:
The image now has a white area right on the middle ,exactly where we were looking for the edge on simple black and white image. If you are wondering about the fact that the white area indicating the edge is bigger than sides, well this is due to our small 8×6 pixels example. If we were going to pick up wider image like 8×20 pixels the white are will be much smaller in comparison to the sides.
There is an simple implementation of EdgeDetection if you would like to try it on your own.Executed on a bigger black and white image of 466×291 pixels the results would look like below:
The edge now is clearly visible and also much smaller than the sides.
In the previous topic we explored the core concept behind Edge Detectors by giving an example of Vertical Edge Detector. In this section,similarly we will walk through an example for detecting Horizontal Edges.
First lets define our image example as below:
Similarly for simplicity we will work with a 8×6 pixels image that would have the below matrix values representation:
255 | 255 | 255 | 255 | 255 | 255 |
255 | 255 | 255 | 255 | 255 | 255 |
255 | 255 | 255 | 255 | 255 | 255 |
255 | 255 | 255 | 255 | 255 | 255 |
10 | 10 | 10 | 10 | 10 | 10 |
10 | 10 | 10 | 10 | 10 | 10 |
10 | 10 | 10 | 10 | 10 | 10 |
10 | 10 | 10 | 10 | 10 | 10 |
Same as previous example 255 is representing white and 10 is representing black(0 will be the darkest black but for sake of explanation we keep some value rather than 0).
Now is time to define the filter. This time we will use a slightly different filter which is just a flipped 90_{o}right vertical filter (horizontal filter):
1 | 1 | 1 |
0 | 0 | 0 |
-1 | -1 | -1 |
If we apply the convolution like explained in previous topic we will get the below matrix values for the convolution matrix:
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
735 | 735 | 735 | 735 |
735 | 735 | 735 | 735 |
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
Same as before the first observation is that the matrix dimension have shrink from 8×6 to 6×4 and the second observation is that in the middle we see big non zero values in comparison to zero on sides.
Recalling the fact that we wanted to detect horizontal edges in a black and white image lets see if this helps. If we transform this matrix back into an image while keeping in mind 0 is black and 255 or greater is white will get something like this:
The image has a white area right into the middle, exactly where we were looking for the edge.. The edge here is quite thick because of the small resolution we have pick up 8×6 in below example the edge will much more thinner.
There is an simple implementation of EdgeDetection if you would like to try it on your own.Executed on a bigger black and white image of 466×291 pixels the results would look like below:
What is the convolution with small matrix doing that enabled us to detect edges therefore having more specialized image features?
An edge is just group of pixels where a continues color changed for the first time into some other color. Referring our simple black and white image the edge was in middle where white change to black. Also it wouldn’t be difficult to imagine edges in more complicated images , we need just to follow the colors boundaries.
Lets see one more time our two filters:
1 | 0 | -1 | 1 | 1 | 1 | |
1 | 0 | -1 | 0 | 0 | 0 | |
1 | 0 | -1 | -1 | -1 | -1 |
On the left we have the vertical filter and on the right the horizontal filter
As we saw the filters were sliding first one position horizontally until we reach the last column and than slide on position vertically and again horizontally until we reach the last row and column. Like below:
Each time the filter is processing a sub matrix of the same size. In this case it processes a small sub matrix of 3×3 in the big original image matrix starting at some position depending where the filter would be. Lets see what is filter doing to this sub matrix:
1 * 255 | 0 * 255 | -1*255 | 1 * 255 | 1 * 255 | 1 * 255 | |
1 * 255 | 0 * 255 | -1*256 | 0 * 255 | 0 * 255 | 0 * 255 | |
1 * 255 | 0 * 255 | -1*257 | -1 * 255 | -1 *255 | -1 * 255 |
On the left is the vertical filter in action and on the right is the horizontal filter in action.
Mathematically what is doing is just multiplying each position and than add up all together by getting only a number in the end. So one position movement produces a number only.
Both of the fitters are ignoring the middle row by multiplying with zero. So they are both interested only on the sides , vertical filter is interested for the left and right side and horizontal filter interested for the up and down sides.
Regardless of the sides position(left , up or right down) filters can produce two states:
The row in the middle was left out this time but of course it will be included next time when the filter will slide one position to the right for vertical filter and one position down for the horizontal filter.
Basically what a filter is doing is just transforming the image matrix to another image matrix where every pixel now signals an edge or not. In our case in this new matrix: if the pixel is 0 it mean no edge here so we fill it with black color. On the other hand if a pixel has a value we take the absolute(-100 or 100 is the same 100)value with a max of 255(-735 or 735 is 255)indicating the presence of an edge, draw with white since higher number means more whiter.
So fare we only walked through simple black and white images. Now is time to explain how convolution will work with RGB colorful images.
Recalling from Data Representation section a RGB image is represented by a three dimensional matrix image[i][j][k] where (i,j) are the positions of pixels and k index(0,1,2) of colors value for Red,Green and Blue. Comparing to black and white sample which had only one matrix with gray scale values from 0-255(0 black, 255 white) now we have three matrices each representing values 0-255 for Red,Green,Blue like below:
The problem does not sound hard to address as now we only have three matrices instead of one. One can suggest to just apply convolution to each of them separately similarly as we did for one black and white:
Well this worked quite well as we similarly to the black and white matrix have value from 0-255 for each transformed matrix. It is just that they instead of gray scale represent R,G,B respectively but a computer does not really care as long as you feed him with numbers.
Although it looks like a solution we have a one last problem to solve. As we know after the convolution the matrices pixels represent knowledge if the pixel is an edge or not. So they do not represent anymore any color information or at least not in the same format as the original matrices R,G,B. Keeping this int mind , does it really make sense to keep three of them?
Indeed keeping them separately intuitively does not seem useful beside we occupy useful memory and processing power. Maybe it will interesting research to try keeping three of them( if such research is not already existing as I am not aware of any at the moment). The solution again is pretty intuitive we just add three matrices together so in the end there is only one matrix left.
As we can see convolution always will result in one single matrix regardless of the number(number of channel will be one) you had before applying it(usually 3 as RGB but it can also have more channels).
Notice that we used the same Filter for the input R,G,B matrix but in reality those filters maybe different. So the convolution of RGB(WxHx3) matrix is done with a three dimensional Filter(WxHx3 where W,H can be 5×5 but the third parameter has to be equal to input matrix 3). If the input matrix has more channels like 4,5,6 than the convolution has to have a filter with same dimension as well 4,5,6. Anyway regardless of the third dimension value(channels number) still the convolution always produces a two dimensional matrix by cutting the third dimension.
There is an simple implementation of EdgeDetection if you would like to try it on your own. Executed on below image:
it gives below results for vertical , horizontal and sobel filters:
Vertical:
Horizontal:
Sobel:
Already we have seen that the convolution operation:
We can of course control both of them in a way that will serve better our model. We have three ways If we would like to control the convolution matrix(CM) dimensions.
We can use padding in the case we want our CM to be equal or even greater than the original image(the third dimension is one for the black and white example). What padding does is just adding more zero value rows and columns to the original image. So in our simple black and white representation it look like below :
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 255 | 255 | 255 | 10 | 10 | 10 | 0 |
0 | 255 | 255 | 255 | 10 | 10 | 10 | 0 |
0 | 255 | 255 | 255 | 10 | 10 | 10 | 0 |
0 | 255 | 255 | 255 | 10 | 10 | 10 | 0 |
0 | 255 | 255 | 255 | 10 | 10 | 10 | 0 |
0 | 255 | 255 | 255 | 10 | 10 | 10 | 0 |
0 | 255 | 255 | 255 | 10 | 10 | 10 | 0 |
0 | 255 | 255 | 255 | 10 | 10 | 10 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We changed the dimension from 8×6 to 10×8 by applying 1 padding(one zero layer all around). Now if we apply a 3×3 convolution filter as explained above to this new 10×8 matrix(by sliding one position horizontally and than one position vertically) we will get a CM with dimension 8×6 , same as the original image. Padding more will result CM to be even greater than original image. By now you already guessed that there is a specific formula by which original image dimension is related to the CM. The formula is fairly easy as below:
where CM is convolutional matrix dimension(column or row) , OM is original matrix dimension (column or row) , P stands for padding number and F is the filter dimensions(column or row).
So if we would like to have the CM matrix same dimension as OM than we need just to solve the above equation to be equal to original matrix dimension:
We found that padding by one will give us the same dimension as original matrix after applying the 3×3 convolution.
Padding by one make the 8×6 matrix evolve to 10×8. Lets try to apply convolution now:
CM^{row} =OM^{row }+ 2 * P- F^{row }+1 = 8 +(1 * 2) – 3 +1 = 8
CMcol =OMcol^{ }+ 2 * P^{ }– Fcol^{ }+1 = 6 + (1 * 2) – 3 +1 = 6
We can use striding in the case we would like to shrink our CM(the third dimension is always one) even more as by default it always shrinks a bit. Until this point we always slide the filter on position , what if we can slide 2,3,4? Sliding the filter with greater factor will result the CM to have smaller dimensions as we simply check less elements on the original matrix. The reason we might want to do that is maybe to save some computation and memory so to say maybe this is not impacting our model in a bad way but instead it speeds the learning phase. Of course again there is formula:
where CM is convolutional matrix dimension(column or row) , OM is original matrix dimension (column or row) , S stands for striding number of positions and F is the filter dimensions(column or row).
So basically S reduces CM dimension by a great factor, in the minimum case is S=2 which means the dimension are divided by two.
Above we saw how to control CM dimensions on rows and columns but still we always had the third dimension set to one(convolution always produces one two dimensional matrix). This is because as we saw convolution will add up all CM matrices into a one CM matrix and this is happening regardless of original matrix had a third dimension(RGB). What we can do here is add more filters and apply convolution to the original matrix for each of this filters. By doing so we end up with multiple CM’s for each filter. So to say if we add 3 filters(each three dimensional or 3 channels as the original matrix) we would end up with 3 CM’s same as original RGB matrix. Below is how this may look like:
So in three above section we learned how to control dimension rows and columns through padding and striding and the number of the third dimension(channel) by adding more filters.
We can of course combine both padding and striding in one final formula which gives us the flexibility to control matrix dimensions:
where CM is convolutional matrix dimension(column or row) , OM is original matrix dimension (column or row) , P stands for padding number, S stands for striding number of positions and F is the filter dimensions(column or row).
Was not mention above but of course we can shrink the size of CM by also increasing the size of the Filter itself(F). Till not we saw only a 3×3 filter but for sure it may change and from formula we can see that the greater the filter dimensions smaller the CM dimension(keeping other parameters constant).
The third dimension(channel) is easy to think about as the number of filters affects directly the size of CM channels or third dimension.
It is possible to produce the same size as the input after applying a convolution ,we just need to solve the above equation to equal to the input size. Usually when we want the same size matrix also after the convolution is applied the operation is called : ‘Same Convolution‘.
Pooling layers are another type of filter that usually used for reducing matrix dimensions therefore speed up the computations. Most of the times is used Max Pooling but maybe also rarely Average Pooling.
Max Pooling(MP) is very similar to filters for edge detection. MP instead of multiplying and than adding all pixels of the selected sub matrix they just produce as output the maximum value of selected pixels. In the same way as filters they slide one ore more positions horizontally and vertically depending on parameters. So they share exactly the same formula as filters for controlling the output dimensions.
IM is input matrix dimension (column or row) , P stands for padding number(usually 0), S stands for striding number of positions(usually 2) and F is the max pooling dimensions(column or row, usually 2×2).
One thing to notice about Max Pooling is that they leave the third dimension untouched(same value as it was). We saw that convolution filters cut the third dimension by always producing two dimensional matrix. While Max Pooling are used only to reduce heightXwidth and leaving third dimension unchanged. This is done by applying the same Max Pooling Filter as much times as the third dimension value was on the original Matrix.( as we saw at Multiple Filters section).
There is not really much to say about Max Pooling as they share every dynamics with filters. Is hard to say what is the intuition behind Max Pooling but usually they work well to reduce dimension therefore speed up computation, reduce overfitting and improve also the model accuracy.
As the name suggest this Pooling Layer instead of doing the max or multiply and add they calculate the average of selected pixels.In practice is found very rarely as more often a Max Pooling will perform better.
Now is time to put all pieces together in building a model which can actually learn and than help us predict and solve problems. The part that doesn’t change here is the Neural Network (NN)we already saw. We only add what is seen so fare before NN start learning, so to say NN now will learn in a more image specialized features processed by Convolutional Technics.
Anyway there is a final trick we need to discover to make CNN even more powerful. So fare we have seen only a limited number of filters : Vertical Edge Filter, Horizontal and Sobel(by example only).
Why can’t we just let the NN figure out what type of filters(edge detectors) are better for the model? Maybe NN can find out better filters than Vertical,Horizontal or Sobel. The final trick consist of parameterizing the filters so instead of hard coding the types they can be discovered by NN. In the same time lets add also many parameterized filters to the model. So it may look as below:
Filter Θ parameters introduced by filters are identical to NN Θ , they are learned by the network with the help of back propagation as explained at previous post. Although there is a small difference with NN parameters, filters parameters are smaller in number so they will not slow down back prop and in the same time offer great features for our model.
E.x if we add 16 filters with 5x5x3 dimension and another layer of 32 filters of 3×3x3 dimension we end up with 2064 parameters. If we take the same example but with neurons :
Input size 28×28 pixels so 784 , two hidden layers (one with 16 neurons and the other with 32) and 10 outputs in total they produce : 784 * 16 * 32 * 10 = 4.014.080 parameters to learn. So the neurons parameters numbers are much larger because they depend on : fully connected inputs ,hidden layers and outputs. While filters parameters remain constant therefore not slowing down that much.Anyway Still convolutional networks take a long time to train on simple computers because of large multiplication operation number comparing to Neural Networks without convolution layers.
On the other hand Pooling Layers do not have any parameters they just serve as a transformer or better as a dimension reducer of the input matrix.
Data used for building the application were taken from this web site :
MNIST database has 60.000 of training data and 10.000 of test data. The data contain black white hand written digit images of 28X28 pixels. Each pixel contains a number from 0-255 showing the gray scale, 0 while and 255 black. Data used here are the same as previous post so for more details how the data are organised please have a quick look here.
Same as previous post we usetwo hidden layers 128 and 64 neurons and 10 outputs(0-9 digits). But of course there is crucial difference in front of two hidden layers(before 128 neurons) we apply several convolution layers and max pooling like below:
Applying the formula we explored at “Putting All Together Section” we get the following dimensions:
First we apply a 20 convolution operations with size 5×5(S=1,P=0) so that gives us: 28-5/1 +1=24 so 24×24 and since we have 20 convolutions we have 24x24x20
Second we apply max pooling with 2×2 and s=2 so this gives us: 24-2/2 +1= 11 so 12x12x20.
Again we apply 50 convolution(5×5 S=1,P=0) and max pooling (2×2 s=2,P=0) and in the end we have 4x4x50(800) image size. We feed the new image to a neural network with 4x4x50(800) as input size, two hidden layers 128,64 and one output 10(0-9 digits). Lets see below how the code will look like.
In previous post we used Spark MLib train a Simple Neural Network and predict hand writing digits. While in this post we used Deeplearning4j framework to train a deep convolutional network. The reason behind that is that SPARK MLib at the moment do not offer convolution layers to a Neural Network. Please find below the network configuration matching above topology(for more details find class code here):
int nChannels = 1; // Number of input channels int outputNum = 10; // The number of possible outcomes int batchSize = 64; // Test batch size int nEpochs = 20; // Number of training epochs int iterations = 1; // Number of training iterations int seed = 123; // MnistDataSetIterator mnistTrain = new MnistDataSetIterator(batchSize, trainDataSize, false, true, true, 12345); MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder() .seed(seed) .iterations(iterations) .regularization(false) .learningRate(0.01) .weightInit(WeightInit.XAVIER) .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT) .updater(Updater.NESTEROVS) .list() .layer(0, new ConvolutionLayer.Builder(5, 5) .nIn(nChannels) .stride(1, 1) .nOut(20) .activation(Activation.IDENTITY) .build()) .layer(1, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX) .kernelSize(2, 2) .stride(2, 2) .build()) .layer(2, new ConvolutionLayer.Builder(5, 5) .nIn(20) .stride(1, 1) .nOut(50) .activation(Activation.IDENTITY) .build()) .layer(3, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX) .kernelSize(2, 2) .stride(2, 2) .build()) .layer(4, new DenseLayer.Builder().activation(Activation.RELU) .nIn(800) .nOut(128).build()) .layer(5, new DenseLayer.Builder().activation(Activation.RELU) .nIn(128) .nOut(64).build()) .layer(6, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD) .nOut(outputNum) .activation(Activation.SOFTMAX) .build()) .setInputType(InputType.convolutionalFlat(28, 28, 1)) .backprop(true).pretrain(false).build();
Training the network this time took approx 1.5 hours and required almost 10GB of RAM memory. As above code shows the network was trained 20 times with a batch size of 64. The accuracy was quite respectable 99.2% in comparison to 97% previously without using Convolution.
The model was still improving and maybe running 20 more times would improve the accuracy even more but running convolution neural networks takes a lot of resources and time. For this type of applications GPU will greatly help in training bigger neural networks and much faster. Feel free to try more deep network or this topology with more epochs. There also already trained networks MNIST DataSet like the well known LeeNet-5. In the source you already can find an implementation of LeeNet-5 which when I run it offered 99% accuracy in 15 iterations also quite fast training(maybe better for prototyping).
Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer. Feel to try it with choosing different options like:
Application already loads a default training executed before hand with accuracy 99.2% tested in 10.000 of test data and trained with 60.000 images.
!!Please try to draw in the center as much as possible as the application do not use centering or crop as the data used for training.
We can run the from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java..
Application was build using Swing as GUI and DeepLearning4J for the executing the run.bat would show the below GUI:
]]>Well you are not alone , as a Java Developer with more than 10 years of experience and several java certification I understand the obstacles and how you feel.
From my experience I know what obstacles a Java software engineering faces with the Deep Learning so I can be of a great help to you in making the journey with deep learning an exciting experience.
In this post we are going to develop a Neural Network with Java for training and detecting Handwritten Digits(0-9). A real application is build using Java and Apache Spark MLib .Feel free to check out the source code and experiment on your own(fairly short instructions at the end).
Neural Network as their name suggests are motivated from Brain Biological Neurons. Although the brain is a highly complicated organ and some of his function still remain mystery to us , the cells which is made of Neurons are fairly simple. I am quoting the explanation from this site :
The neuron is broken up into two major regions:
- A region for receiving and processing incoming information from other cells
- A region for conducting and transmitting information to other cells
The type of information that is received, processed and transmitted by a neuron depends on its location in the nervous system. For example, neurons located in the occipital lobe process visual information, whereas neurons in the motor pathways process and transmit information that controls the movement of muscles. However, regardless of the type of information, all neurons have the same basic anatomical structure.
Basically Neurons are made of an input(dendrites) , computational unit(cell body, nucleus) and output(axon). So signals(pulse of electricity spikes) come from other neuron axons to some of the dendrites, than the neuron processes the signals and finally transmits the signals though his axon to some of other neurons dendrites.Basically something like the picture below suggest(source):
So the question is can we copy this fairly simple model to a more computer friendly model? After all that is what computers do best getting inputs , processing them and outputting.
So what we will need is a model which take inputs , transforms them and outputs in a form which is ready for other similar models to consume. A good candidate model will may be like below:
With blue we mark the neuron and green the inputs.
So basically the model is processing some inputs numbers like X^{1 }X^{2 }… X^{n }and than outputting the result. We need of course to clarify how we are going to process the inputs. In my knowledge how the real Neuron processes the signal is not known. Anyway we can use some well known function seen at previous posts at Logistic Regression the Sigmoid Function. Just to recall the Sigmoid Function looks like below:
Sigmoid function is outputting approx. 1 when inputs tend to be greater than zero and approx. 0 when values tend to be smaller than zero.
So lets take an example how our model looks so fare:
With blue we mark the neuron and green the inputs.
Since the summation of all our inputs is negative(-5 in figure is zero) the output is 0. On the other hand if the outputs will be like X1=10 ,X2=-20, X3=30 ,X4=-5 the output will be 1.
So fare so good but as we saw the Neuron had multiple outputs not only one as our simple model suggests. And the outputs are not just clones of one but each is different. To adapt our model we will introduce the concept of weights Θ. So before output is transmitted we will multiply by some weights(Θ). Lets see below how our model looks like now:
With blue we mark the neuron and green the inputs.
Now we are able to produce multiple different outputs just by multiplying by weights(Θ). As the figure suggested we multiply the output 1(25 in sigmoid function is 1) with weights(Θ) and got 3, -5 and 10 as outputs.
Is worth to notice that even if the inputs are the same the outputs differs from the previous model -5->0 to 25->1 .This happens because of the impact weights(Θ) introduce into the model. Since we gave a lot of importance to X2 by multiplying with weight Θ^{2}=2 the model produces a positive result now. In a few words weights(Θ) besides gave us a model supporting multiple outputs also gave a way to greatly impact the model itself(output).
Till now the model processes multiple inputs with sigmoid function and produces multiple outputs by multiplying with weights(Θ). Although similar in a sense to the biological neuron this is an isolated model. After wall we have billions of neurons connected and communicating all the time. So now is time to connect our model in a a network.
So fare we connected multiple inputs with only one neuron which in his hand produces multiple outputs(unconnected). We can enrich our model by connecting the inputs not only with one neuron by with many of them. This model will look like below:
With blue we mark the neurons and green the inputs.
The model now looks more like a network as it gives the the flexibility of connecting different inputs with different neurons. Although the model is not yet complete as in reality Neurons on their hand can connect with other Neurons and this Neurons with others and son on… It is time to do a final modification to the model by introducing another Layer of Neurons which is connected with previous Neurons.
With blue we mark the neurons and green the inputs.
Now we have a model which can easily grow to a big network and can even have different shapes. Is worth to mention that Neurons of Layer 1 are just inputs for Neurons of Layer 2. Sigmoid Function was applied once in Layer 1 multiplied by weights(Θ) and than applied again in Layer 2 and maybe if another layer will be added would be applied again depending how deep we want to go.
So fare we have build a model which is quite similar to real Neuron Networks as it can process multiple inputs is able to transmit multiple outputs to other neurons connected to the network. What we are missing is how to train the model so it can help us predict or solve problems.
As we saw previously in Logistic Regression and SVM we will need a model which generates a hypothesis first. To be able to train you will need first to generate some answer called hypothesis and than evaluate how well this is doing in comparison to what we want or real value. After evaluating or getting the feedback we need to adjust the model so it will produce a better hypothesis or one which generates answers closer to real values. Of course first the hypothesis can be very fare from what we want but anyway all starts with an hypothesis.
For Neural Networks(NN) the hypothesis is identical to Logistic Regression so it is represented by Sigmoid Function:
Z is the function explained in topic Insight.
Z is also identical to what is explained at Logistic Regression Insight with only one difference that we have multiple Z at NN in comparison to one in LR(Logistic Regression). To understand that lets see the formal representation of Z for LR:
n-> number of examples
k-> number of features
θ^{j }-> weight for feature j
X^{j}_{i }-> the i-th example X with feature j
In the cancer prediction example we can write like below(age, diet, genome are features and the numbers are weights Θ):
One can easily notice that the weights Θ are defining how much a feature is contributing to final prediction or hypothesis so better weights Θ better or hypothesis will do in comparison to real values. Now lets see why we have multiple Z for NN by taking only one Neuron first:
As we can see the above calculation for multiple outputs and one neuron is the same as logistic regression: Z=Θ^{1} * X^{1 }+ Θ^{2 }* X^{2 }+ …. Θ^{n }* X^{n }and where sig(Z)
One can easily spot that adding another neuron will lead to all inputs connecting to that neuron and as consequence having another Z2 born like below:
Not only we have a new Z for a new connected neuron but also the weights Θ changed from a Vector(Θ^{i }) to a Matrix like Θ^{ij }where i is denoting the input and j the neuron that we are connecting to. There is one last piece missing, adding another layer of Neurons and connect outputs of Neurons on Layer 1 to Neurons of Layer 2.
Beside that the picture become a bit more messy we can notice that also now we have multiple Z^{i }per each layer so Z^{i}_{j } where i denotes the neuron and j denotes the layer this neurons belong to. Notice that also we mark Θ_{k}^{ij }with extra k just to represent the layer which the weight is contributing to. Differently from LG we just have multiple Z’s and h(X)’s but Z and h(x)(sig(Z) or hypothesis) itself stays the same:
So NN introduce for each layer a hypothesis(Sig(Z) ) per neuron in comparison to LG which had only one hypothesis and tried to fit all data there.Is like NN are trying to figure out the solution step by step instead of all at once. Each hypothesis’s output is multiplied by Θ and entered as input to another neuron which on his hand produces another hypothesis and so on… until we have the final output which we can interpret as the answer.
As we mention earlier once we have a hypothesis we need a another method or function which tells us how good our hypothesis is in comparison with real value we have from labeled data. The function is called cost function and is just doing the average squared difference of the hypothesis with real data value , identical to LG (ignoring regularizing parameters):
where y_{i} is the real value or category like spam or not spam 1 or 0 and h(x) is the hypothesis and m the number of examples we have for training.
Supposing we have only one output the formula stays identical with LG. It just of course h^{Θ}(x^{i}) is calculated differently. Here h^{Θ}(x^{i}) is referring to the final hypothesis but as we know this hypothesis has gone to a long way of calculations and re calculations from layer to layer and neuron to neuron.Something like h^{Θ}(h^{Θ}(h^{Θ}(h^{Θ}(x1))))…(plus other h^{Θ }and multiplying by Θ).
How about for different outputs? Well is not changing much we just need another loop for each output y^{i}_{k }(where k refers to the output k and m is number of outputs):
Ideally we want our cost function J to be zero so hypothesis will equal to real value or at least the difference to be as small as possible. In a few words once we find a way to minimize this cost function than we have a model ready to predict as it already learned to generate hypothesis as close as possible to real labeled data values.
So fare we have a hypothesis also a function to tell how good the hypothesis is doing in regards to real data. Now we are ready to use the feedback to improve our hypothesis to be more close to labeled data or real data we have. Again here the procedure is the same as LG we simply use Gradient Descent(previous LG post) to minimize the cost function.
First we pick up random values of θ just to have some values,than calculate cost function. Depending on results we can lower our θ values or increase so the cost function is optimize to zero. We repeat this procedure until the cost function is almost zero(0.0001) or is not improving much iteration to iteration.
It uses derivative of cost function to decide if to lower or increase θ values. Beside the derivative, which is just giving a direction to lower or to increase θ value, it also uses a coefficient α to define how much to change the θ values.
Derivation is also where the LG differs from NN since NN are using a more sophisticated way of calculating the derivative known as Back Propagation Algorithm. Although Back Propagation Algorithm is very interesting , is also heavily mathematically intensive and maybe I will address it in more details in next post. But for now we can think it as a black box which gives use the derivative of the final cost function. After that Gradient Descent can easily minimize so that the hypothesis output and real values can be as similar as possible(ideally the difference is zero).
Data used for building the application were taken from this web site :
MNIST database has 60.000 of training data and 10.000 of test data. The data contain black white hand written digit images of 28X28 pixels. Each pixel contains a number from 0-255 showing the gray scale, 0 while and 255 black.
The way the data are organized is not in any of standard image format. But fortunately there was already a solution reading the data perfectly and surprisingly easy(thanks to StackOverflow comment). Here is how we read the data : for each entry we build a java bean LabeledImage:
public class LabeledImage implements Serializable { private double label; private Vector features; public LabeledImage(int label, double[] pixels) { this.label = label; features = Vectors.dense(pixels); } public Vector getFeatures() { return features; } public double getLabel() { return label; } public void setLabel(double label) { this.label = label; } }
It has the Label which is the real digit from 0-9 and Features Vector(used Vector because of MLib requirements ,List,ArrayList will be fine for more general purposes) which represent the pixels in one dimension. So in our case we have 28X28 pixels which contain a number from 0-255 this will mean we have a single array with length 784 containing numbers from 0-255. After reading the data we will have a list of LabeledImage like List<LabeledImage >.
We described so far the model which had the input, different number of layers which are processing and output.We did not describe the real nature of the input because it was more abstract at that time but now is time to explain in a more specific way.
The input on the model we described(X1…Xn) is the Features Vector on LabeledImage object. Lets think of one example represented by one LabeledImage object which has inside Features Vectors (is simply a one dimensional vector containing pixels 28X28->784 values from 0 -255). The input of our model is the Features Vectors so the input size is 784. Of course this is the case of one example so to scale for more examples we simply execute the model for each example. In a few words the inputs X1..Xn are not the examples but the features of your data for one example(n in our case is 784). So to fully train your model you will need to compute cost function,derivative for each example.
The output on the other hand is more easy to reason because we have 10 digits to discover from 0 – 9 so the output is a one dimensional vector of size 10. The values of output vector are probabilities that the input is likely to be one of those digits. So lets say we already trained our model and now we are asking it to predict a 28X28 image(3). The output maybe something like this : [0.01, 0.1, 0.4, 0.95, 0.02, 0.05, 0.03 , 0.1, 0.5 ,0.02]this is translated like : there is 0.1 % probability the input is 0, there is 1% probability the input is 1 , there 40% probability the input is 2 , there is 95% probability the input is 3 and so on… So the index of the item in the vector represent the digit and the value the confidence the model has that the input is that digit.
So fare we have a model with an input of size 784 and an output of a size 10. Now is time to configure the other layers or the hidden layers. Where there is no magic way of deciding we ended choosing two hidden layers : 128 and 64 neurons. Theoretically more layer better it is but the training it will also be much slower. So deciding about layers is mostly based on the desired accuracy of the model. There is room for improvement here maybe in future is worth to try different layers configuration and see what it will work best. Also this can be done automatically and then choosing the best model. The code for training looks like below:
public void train(Integer trainData, Integer testFieldValue) { initSparkSession(); List<LabeledImage> labeledImages = IdxReader.loadData(trainData); List<LabeledImage> testLabeledImages = IdxReader.loadTestData(testFieldValue); Dataset<Row> train = sparkSession.createDataFrame(labeledImages, LabeledImage.class).checkpoint(); Dataset<Row> test = sparkSession.createDataFrame(testLabeledImages, LabeledImage.class).checkpoint(); //in=28x28=784, hidden layers (128,64), out=10 int[] layers = new int[]{784, 128, 64, 10}; MultilayerPerceptronClassifier trainer = new MultilayerPerceptronClassifier() .setLayers(layers) .setBlockSize(128) .setSeed(1234L) .setMaxIter(100); model = trainer.fit(train); evalOnTest(test); evalOnTest(train); }
When I first tried to train the network the results were a disaster…. only 10% of images were able to detect correctly. Than I notice that the data were not uniform in a sense that one can find values like 0 , 0 ,0 1,2, 200,134,68 …. So I decide to normalize data to have more uniform values between 0 and 1. The formula we already explain on previous post is like below:
Where μ_{i} is the average of all the values for feature (i) and s_{i} is the range of values (max – min), or the standard deviation.
So what we do is for each feature in our case each pixel value we subtract with the mean of all pixel for that image and divide with difference between max pixel value and min pixel value on that image. The code looks like below:
private double[] normalizeFeatures(double[] pixels) { double min = Double.MAX_VALUE; double max = Double.MIN_VALUE; double sum = 0; for (double pixel : pixels) { sum = sum + pixel; if (pixel > max) { max = pixel; } if (pixel < min) { min = pixel; } } double mean = sum / pixels.length; double[] pixelsNorm = new double[pixels.length]; for (int i = 0; i < pixels.length; i++) { pixelsNorm[i] = (pixels[i] - mean) / (max - min); } return pixelsNorm; }
This implementation is not 100% according the formula we saw above because it normalizes per example/image(average,max,min are calculated on one image pixels). While the formula we saw above requires calculation of average,max,min on all examples/images pixels and than subtract with that average and divide with that max-min each image pixels. Since our example is a simple black and white image this normalization works fine, there are even more simple implementations like deeplearning4j just divides with 255 for hand writing digits Mnist Dataset. Anyway in other applications the above formula should really be applied on all examples and per feature.
After applying normalization accuracy increased dramatically to 97%. So only 3% were wrongly detected by the mode. This is a good results taking into account the simplicity of the model and the effort. Almost everything else was handle automatically by the Model and Back Propagation algorithm.
Of course there is plenty room for improvement in here:
Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer. You can try it by your self by choosing different options like:
Application already loads a default training executed before hand with accuracy 97% tested in 10.000 of test data and trained with 60.000 images(two layers 128 neurons, 64neurons).
!!Please try to draw in the center as much as possible as the application do not use centering or crop as the data used for training.
We can run the application from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java.
Application was build using Swing as GUI and Spark MLib for the executing the run.bat would show the below GUI:
]]>Please feel free to explore source code and application(feel free to find at the end some instructions and preview).
As online retail business is growing by having large inventory items, unfolds the challenge of suggesting the user only what it will be most interesting and useful from selling perspective. According to this article Amazon experienced an increase with 29% from helpful suggestions so their impact in business maybe considerable. In same time another article suggests that although recommender systems can be very useful we need significant data and knowledge about our users in order to have good results.
Most of the times we are faced with very few Explicit Feedback(EF) like ratings or likes but fortunately on the other hand we have user activities like clicks,view, time spent , buying history. Implicit feedback in contrast to EF doesn’t offer direct preferences clarity. A rating scale from 1-5 can give a very clear feedback how a much a user prefers certain genre of movie or book. On the other hand the fact that a user bought or viewed a book does not necessary mean that the user liked it(maybe is a gift). Also the same is true when user hasn’t clicked or seen a movie doesn’t mean he doesn’t like it.
Although implicit feedback introduces some uncertainty most of the times provides useful insight. For example a user that bought a book from an author several times probably prefers that author books. Or if a user stayed long time reading an item review or description or even clicked and return several times than probably this item category is of interest and attracts him for the moment.
Regardless of the situation is obvious that an implicit feedback system offers a level of confidence on user preferences in contrast with EF which offers the preferences itself.
EF systems offer both positive and negative feedback on a specific scale like ratings 1-5. As consequence when implemented EF systems will take in consideration only data that user rated and ignore not known data. This help the algorithm to scale good and in same time perform well. While with implicit feedback we have no negative feedback but only positive feedback like user activity clicks, purchase history. Therefore we cannot simply ignore zero activity items as they may be of great interest to the user.This leads to algorithm processing large data and not scale good when the input increases. Fortunately this paper suggest an optimization which speeds up the processing time.
So in a few words we need a modification to our previous method using EF in order to address :
This section is a short humble explanation of the paper Collaborative Filtering for Implicit Feedback Datasets by Hu, Koren, and Volinsky which gives the model and solution for this problem. It is worth to mention that also Apache Spark MLib uses this paper as reference for the implementation of Collaborative Filtering algorithm.
Recalling from previous post the cost function looked as below(here we are following paper semantics):
In this section we are going to modify above equation to adapt to IF problem. First we start by defining the preference p^{ui} of user u on item i:
In a few words the modification is about giving more weight to preference that we have high confidence by increasing the cost of a prediction mistake. So Logistic Regression will have to minimize especially those preferences if it want to ever converge which it will no matter what as it is a convex function.
The online retail data used for the building the algorithm and application can be found here. There are 541.910 rows containing customer and product related data.
What will be used as user u interaction on item i r^{ui }is users purchase history on items. So if a user has a history buying a lot of some items type we will tend to think he likes similar items and may want to buy again. On the other hand if user buys something it doesn’t mean he necessarily likes it, after all it may be a gift or user was not satisfied with product. This uncertainty that buying history offers makes it good choice for our interaction source. The data will look like below:
Data need to be pre-processed and prepared because there are some mistakes here and there and in same time Spark MLib accepts only Integers has productId and userId.
Similar to what is described on previous post we are dividing the data into two groups(previous was divided on three : training data, cross validation, test data) training data and test data.
Training Data are randomly chosen as 80% from all data set and test data the other randomly 20%. As always training data are used for training the algorithm while test data are used to see how the algorithm performs with non seen data.
As method to evaluate we compare the prediction and the real values using the RMSE method described here.
What RMSE is doing is basically calculating the squared difference of the prediction and real value for all data. The squared is used in order to give more weight(exaggerate) to the differences between what algorithm predicted and what is the wanted value.
Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer. You can try it by your self by choosing different options like:
Application already loads a default training executed before hand with RMSE 42 and 350 features, 80% training and 20% test and reg param 0.01(that is why it may take some seconds for app to load).
We can run the application from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java.
Application was build using Swing as GUI and Spark MLib for the Collaboration Filtering Algorithm and executing the run.bat would show the below GUI:
]]>
What we want to build from high level perspective is an autocomplete field that when we type some characters it suggests book titles that start with those characters.
There are various options:
Again there are various options:
for (String title : allTitles) { if(title.startsWith(charsEnteredByUser)){ options.add(title); } }
If N is the size of the list and k the length of the words we need θ(N*k) time to search. Inserting a new title takes constant time(θ(1)) although adding new films happens fairly rarely.
In this section we will explore how Tries can help with searching for a prefix match in a list of titles(words). Tries are fairly easy to understand once you get how the words are inserted:
So basically we insert word’s characters in a separate nodes when characters are not already existing and re using existing ones. We mark also the end of each word with a special sign so later on we know when a full word is reached.
Lets see how we can search with titles starting with “te”:
When searching we first start from the root and look up on immediate children’s for our first character(t) match. When node matching character is found we treat it as the root so we continue to look up in direct children’s for next character(e) match. This logic continues until there are no more characters left on prefix. If that is the case than all the suggestion list is the sub-tree below our last node match. So we simply traverse all the sub-tree and add words when the end of the word sign is reached.
You may think : Not so fast the complexity there is not θ(k) where k is the length of the prefix! Indeed the complexity is rather θ(k+M) where k is the length of the prefix and M is the size of the suggestion list or the sub – tree under the last node match(immediate children are kept on HashTable so constant time is need to look up for character match). Anyway we need to traverse the sub – tree to collect the suggestion words /titles and therefore if this list results big it can considerably slow down the algorithm. Of course is better than θ(k*N) here k is the length of the prefix and N the size of the all list but still can we do better?
Well we can slightly augment nodes to have store more information than just the character as below:
As we notice by now we store in each node beside the character also the word we are inserting(in practice a reference to the word).Step by step each node will have a list of words that passed on the path.
This modification can greatly help to avoid going down all the sub-tree under the last matched node since the node now already have the list of the words which the sub- tree contains. Lets see below how the search would look now:
The difference with this solution is that when we reached the last node that node already has ready the list of words starting with the prefix. So there is no need for sub-tree traversal therefore the complexity is now θ(k) where k is the length of the prefix.
There is a final small trick and the algorithm is ready to be implemented. Titles usually are sentences rather than a single word. It will not be very useful if we search only the beginning of the title because for example a lot of title start with : “The …”(The walking dead) therefore we will miss those suggestion if user search with something more meaningful than “The”.
The solution is easy we just insert each of the words separately in the tree but also save all Sentence of Title to the node suggestion list. In this way we can search with middle words(walking) and in same time be able to suggest all title.
The code is fairly easy(50 lines) so please feel free to have a look Trie and TrieTest.
We have only a limit number of suggestion so when it comes to what suggestions to show to the user I think the best answer is: what is more relevant or more close to user interests. This leads us to 4(was underlined anyway :)) and therefore to Recommender Systems.
A recommender system suggests to users information based on their preference trend on the data. The main advantage of this systems is that it learns automatically as it knows more from users preferences. So basically more the users interact with the system(users likes/clicks particular books, movies) better suggestion(more close to user interest) is the system going to make them. On previous post we explained in details how this is achieved using Collaborative Filtering Algorithm. Also an Application was build for suggesting Movies based on user ratings. In this post we are going to implement same algorithm for suggesting books instead of movies.
Thanks to this source for providing enough data to build a meaniful algorithm:
“Improving Recommendation Lists Through Topic Diversification,
Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW ’05), May 10-14, 2005, Chiba, Japan. To appear.
Download: [ PDF Pre-Print ]
“
The data set is quite big approx. around 271.000 books, 1.1 million of ratings and 279.000 of users..
Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer.We can run the application from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java.
You can try it by rating first some books(!please notice that if books are not rated first no suggestion are made) and than search in the field for autocomplete suggestions.Feel free to play around(50 features do not take much to train) and notice how algorithm adapts in accordance with your preference changes. For 50 genres it may take to 30 seconds maximum to give you suggestion with predicted rating and the error of the algorithm or Mean Squared Error. You can as well increase the ratings data size till 1.149.000but please be aware of the slow down of the training process
Application was build using Swing as GUI and Spark MLib for the Collaboration Filtering Algorithm and after running you the below screen will show up:
]]>Recommender system are probably one of the most widely used applications in Machine Learning. A recommender system suggests to users information based on their preference trend on the data. The main advantage of this systems is that it learns automatically as it knows more from users preferences. So basically more the users interact with the system(users likes/clicks particular books, movies) better suggestion(more close to user interest) is the system going to make them.
So we have a very big list of movies of different genres and our user has rated few of them. Now we want to predict what ratings would this user would have give for movies he haven’t see or rate and suggest the user only top ten highest predicted rates.
Lets suppose that we have how much of a certain genre a movie is(70% action and 30% romance) and the rates(1-5) given by our users. Given users rates and how much of genre a movies is we want to know how much our users like the movies genres or what is the trend of users preferences towards genres.
For example since Bob has given 4 to Transformers and Transformers is Action movie we can say he like Action movies. Similarly Alice has given 5 to Beauty and the Beast(Romance) so she tends to like Romance movies.
The reason we want to know what are user preferences is because it will help us to predict if user will like or not a specific genre of unrated movie.
Rates are from 0-5
Lets suppose now that we have ratings from users per movie and how much users like movie genres. Now we want to find what type of a genre a movie is.
For example we know that Bob like Action movies and has given 4 to Transformers so most probably Transformers is a Action movie. Similarly we know Alice likes Romance movies and she rated Beauty and the Beast 5 so most probably Beauty and the Beast is Romantic movie.
Rates are from 0-5
Mos of the time is hard to figure out features like Action,Romance,Family because there can be more like Sci-Fi, Animated,Adult… and who knows what more features will help us to get better suggestion. In order to figure out the features we can use users ratings and their preferences towards features.
For example lets say we have 4000 users and lets say 10.000 of movies. A feature can be born when 300 users with similar preferences like (4 to 5 stars) a group of 1000 of movies and this feature can be something we could never come up logically(because one or two specific actors play there maybe).
Insight here is similar to what we saw at Logistic Regression. Both problems we have to solve have multidimensional data(features) and a prediction used for training(rates).
Lets suppose we magically have how much of a Genre(horror,family,action) a specific movie is. What we want to predict is ratings.
More specifically we want to find out how much a certain Genre is contributing to User ratings. In few words we want to know the weights(θ) of each Genre to particular user preferences. The weights can greatly differ from user to user , for example Action Genre can have grater impact on Bob rather than on Alice or Romance Genre can have greater impact on Alice rather than Bob. So for Bob we can have :
and for Alice :
More formally:
n-> number of examples
k-> number of Genres(features)
θ^{j }-> weight for genre j(want to have)
X^{j}_{i }-> amount of genre j of the i-th movie(known)
R_{u}-> Rating for User u
So once we have weights(θ) it will be straight forward to find user ratings for all movies not yet rated by him, after all now we have user preferences through weights. We need just to apply weights(θ) found by Algorithm and genres types(X) to the simple equation above. So to say if movie is action we will rate high for Bob and low if Romance.
Here the problem is reversed. Lets suppose that magically we have the weights(θ or user preferences) but we are missing the genres or missing description/classification of movies. Still what we want to predict is ratings. Now the role of the weights(θ) is simply replaced by genres. The equation is exactly the same as above but with one difference weights(θ) are not variable but they are known:
More formally:
n-> number of examples
k-> number of Genres(features)
θ^{j }-> weight for genre j(known)
X^{j}_{i }-> amount of genre j of the i-th movie(want to have)
R_{u}-> Rating for User u
So once we have how much of certain genre(X) a movie is it will be straight forward to find user ratings for all movies not yet rated by him. Now beside user preferences which is given we have also detailed information(genres) about movies. So we can easily say if particular movie is close to user preferences or not. We need just to apply weights(θ) and genres types(X) found by Algorithm to the simple equation above.
Once we understand the problem in isolation is time to figure out how the magic is done. In reality we do not have weights(θ) or genres types(X) so basically we have an equation with two variables. For simplicity lets take simple equations for each of our problems:
1 * θ + 1* X = 5 (Users Preferences)
2 * θ + 3 * X = 5 (Genres)
1.5 * θ + 0.5* X = 5 (Users Preferences)(2)
1.5 * θ + 5 * X = 5 (Genres)(2)
If you are disappointed because there is still some magic feel free to read next section about this two. Fortunately we have that magic method and is called cost function and we have the way to change θ and X is called gradient descent.
This is the method that magically tells us how fare from the best solution we are. It is very similar to Logistic Regression post , even more simpler. We want the best user preferences looking at the rates and genres and similarly the best genres looking at user preferences and rates so not just some values.
So what we need is to compare how well our current solution is doing. Lets call the current solution hypothesis. Hypothesis are basically what we say at Insight section so our way to predict rates:
Replacing X and θ values gives us rate for particular user.
Once we have the hypothesis rate(Ru) calculated on current X and θ we can simply compare it with the real user rate we have. So to say if hypothesis rate gives us 2 and in reality user rated 5 than this tells us that 5-2=3 units away from wanted value. Similarly if hypothesis rate gives us 5 and in reality user rated 1 than this tells us that 1-5=-4 units away from wanted value. What we want ideally is that our hypothesis is exactly like the real value so the difference will be 0. In a few words we want to minimize the cost function.
In reality cost function calculates the average of the squared difference of our prediction(Ru hypothesis rate) with real data(user rates) like below:
where y_{i} is the real value like rate and h(x) is the hypothesis. We want cost function J to be zero ideally or very small because this tells us that there is no or little difference between hypothesis and real prediction.
We have our hypothesis which gives the current prediction of our algorithm and cost function which tells us how well performs. What is missing is a way to react after cost function calculates the performance of hypothesis. Reaction is basically changing values of θ and X by decreasing or increasing so cost function will be minimized.
Fortunately there is already build in algorithm to minimize the cost function Gradient Descent. GD is an iterative algorithm which iteration by iteration changes values of θ and X until cost function goes almost 0 or converges. It uses derivative of cost function to decide if to lower or increase θ and X values. Beside the derivative which just giving a direction to lower or to increase the value it also uses a coefficient α to define how much to change the θ and X values.
Changing θ and X values to much(big α) can make Gradient Descent fail optimizing cost function to zero, since a big increase may overcome the real value and also a big decrease may go far from wanted value. While having small change of θ and X(small α) means we are safe, but the algorithm needs a lot of time to go to the minimum value of cost function(almost zero) since we are progressing too slow towards the wanted or real value.
More formally the function looks exactly like Logistic Regression post. Just in our case we have two equations: one for finding user preferences(θ) and one for finding movies genres(X)(Partial derivative changes for user preferences we multiply by X and for genres we multiply by θ) . In reality both equation are merged together to optimize the execution cost but since the equation becomes too messy we would not show it in here, feel free to find more formal view here.
Plotting cost function J with θ and X values can help to understand what Gradient Descent is doing. For simplicity we take as example one dimensional data. Looking at the cost function equation :
we can simply the equation with this one Y=X^{2}
Plotting Y with X values from -10 to 10 it will look like below:
Data used by application where taken from MovieLens. More specifically the small data set used for education and development. Data contains 9125 movies, 671 users and from there 100.004 rating from this users in total.
As we mention in algorithm details we are not using any genres from the data movies but rather let the algorithm figure out genres and rating as two problems helping each other step by step. Said that we still use some basic genres just for making application GUI more friendly. So to say you are asked to rate movies and for simplicity the movies are categorized on some basic genres like below:
* Action
* Adventure
* Animation
* Children’s
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)
But again the algorithm does not use them. For the matter of fact algorithm is using 50 features as minimum and letting you to choose more. More feature we choose better the algorithm is performing but notice that in same time is becoming slower to train.
Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer. You can try it by your self by rating some movies and than wait for suggestions. Feel free to play around(50 features do not take much to train) and notice how algorithm adapts in accordance with your preference changes.
We can run the application from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java. Application was build using Swing as GUI and Spark MLib for the Collaboration Filtering Algorithm.
After running feel free to rate some movies by choosing genres and than hit the suggest movies button and for 50 genres it may take to 30 seconds maximum to give you suggestion with predicted rating and the error of the algorithm or Mean Squared Error.
After that you should be able to see something like this:
Genres Size field is giving the flexibility to choose how much of features(genres) you want the algorithm to know and figure out. More genres better the algorithm performs and predicts ratings but notice that slows down the training time. Below you can find how the error would like after training(0.029, close to 0 better it is, Mean Squared Error):
It will not be hard to extend the algorithm with bigger data from MovieLens , just need to put under the folder src/main/data or maybe change the directory in code fairly easy PrepareData.
]]>Although there are great articles out there I believe still some questions remain : What really is an integration test? Can it be a JUnit test? Are Integration Test duplicates of JUnit? What should any of the test types cover and what not? What kind of tests does an system need?
In this post we are not going to invent other definitions test types but rather hopefully come to a common understanding which also would help to use them correctly.
In previous post we explored this type of tests by an example and come to conclusion that is essential to avoid Testing Implementation Details and not follow the rule “Test each method and each Class“. Here comes first tentative to describe Unit Tests:
A Unit tests is a simple test that runs on Isolation from other Unit Tests(not from his internals) aiming to cover only a small part of our system.
We cannot have a database in a JUnit test since is shared resource and tests can interfere to each other business.
What if we create a new database in the beginning of each test and than drop it, so each tests has its own database? We certainly follow the “Run in Isolation Rule”! This good question reveals the fact that our definition is not complete. We mention the word “Simple” there but simplicity is very relative and it is not enough. Additionally a Unit Test needs to be also very fast. The reason is well know: on this phase of development we need very fast feedback loop. A fast feedback loop helps us to develop faster and understand better the code. In a matter of seconds we know if our code is in the good path or not and change in the very earlier phase of development where the code added is not big and complete. So lets try again:
A Unit tests is a very fast response test that runs in Isolation from other Unit Tests(not from his internals) aiming to cover only a small part of our system.
There is a still an ambiguous variable in the definition and that is “very fast”.Very Fast could be relative depending what is fast means on different system so I wold say that means “As Fast as Possible“. In a few feel free to mock every shared resource like database and every thing that slows down the test like : Web Services, Heavily Network Access, third party Systems… But is important as we say in this post to not mock Internals or Implementation Details as this leads to awful problems and takes agility from your hand without even noticing it.
Is worth to mention that this type of tests are the most confused one.Usually I heard people call them End to End Test or even Functional Test.As we will see this kind of test are not End to End as they test the system on entirely different conditions.
As we saw above JUnit Tests tend to test the system in state fare away from Production. Many things are mocked there and small parts are tested in isolation. So we do not have the confidence that the this small well tested parts of system are working correctly together. Below problems may emerge:
Integration Tests are testing parts of the system in interaction with each other as consequence testing the system in a state more similar to Production.
Here it starts the confusion…
Lets try to give answer to this questions.
Integration Tests still are not testing a production system state.
Depending on the application type, YES sometime they are.
An integration tests should always created to test what JUnit is missing under the conditions we explained at the first point. So basically is not meant to test third parties or over test things that JUnit already covers like NullPointerExceptions checks or empty fields, formats… This hardly depends on the system but if we think in terms of what new missing paths from JUnit our test is covering than we are fine. Is worth to mention that Integration Tests take more time to execute so we should be more carefully in added them and really think about the value they add or quality they increase.
As usually as fast as possible with a system that looks more close to Production. So in theory we should expect our Integration Tests to execute more slowly. We should look carefully how long all Integration Tests take to execute and this should not be too long as this decreases our ability to validate changes therefore slow down response to change. It greatly increase the time to go to production and providing value for the customers. Here is an example how it can go:
For example if all Integration Tests take 2 hours to execute than a developer needs 2 hours and to find out if the code is working or not which is 1/4 of working hours. During this time a developer can be blocked and wait with doing almost nothing. Or even worst it can switch to another different task and loose the focus completely and make it hard to return later and maybe introduce another bug. It is also possible that during the 2 hour time and the time needed to fix the bug found by Integration Tests some other developers pushed code and when the new code with bug fix is pushed it is still not working because of changes. So here it starts again you wait 2 hours….
By now it should be clear that Integration Tests should run as fast as possible by following the first point suggestions and avoiding adding duplicates to JUnit tests already coverage.I would say 20-30 minutes is acceptable time.
When is not possible to achieve acceptable times than I would say the system is to big and monolithic and is urgently needed to separate in smaller systems or modules which can easily tested and maintained. Is better to slow down and split the system than continue that way until one day it will not be even possible to go to PROD(or it may take months).
In principle this kind of tests cover what is left from JUnit and Integration Tests.
Ideally End to End Tests test a system exactly as Production one.
We say “Ideally” because is not always possible to have a system that looks exactly as Production.Than the questions arrive:
Usually anything that JUnit and Integration Test are missing should cover here like production like infrastructure,third parties, all other slow systems we mocked previously, frameworks,platforms,technology,tools… We should be carefully to not duplicate tests in here as e2e test are very slow and expensive to maintain. For example if our application has only DB and not calling any web service or we did not mock other systems on Integration Test than probably adding e2e test makes no sense as Integration Test covered already the functionality. Probably is highly suggested to execute some few Smoke Tests to see if the all infrastructure is working as intended like load balance is working, DB is accessible or information is coming to Queues and so on.
Same also here as fast as possible with a system preferably exactly as production. This kind of test are really hard to maintain and slow. Most of the time this kind of tests tend to fail for non functional reasons like network problems, other systems issues, credential problems and so on. In a few words generally problems we do not have in production. This problems take long time to investigate as we go through logs, travel through different systems and in the end we find out that connectivity is not working. The test itself here can take more time to execute but as always we should notice carefully how long they need to run total and how often they fail for non functional reasons. With end to end tests is hard to give any total time as the system is like production and it really depends on the system nature but I would really try to have very few of this tests. Is better to have longer running integration tests than end to end tests, this tests really can harm the productivity.