Are you Java Developer and eager to learn more about Deep Learning and his applications, but you are not feeling like learning another language at the moment ? Are you facing lack of the support or confusion with Machine Learning and Java?
Well you are not alone , as a Java Developer with more than 10 years of experience and several java certification I understand the obstacles and how you feel.
From my experience I know what obstacles a Java software engineering faces with the Deep Learning so I can be of a great help to you in making the journey with deep learning an exciting experience.
In this post we are going to develop a java face recognition application using deeplearning4j. The application is offering a GUI and flexibility to register new faces so feel free to try with your own images. Additionally you can check out the free open source code as part of the PactPub video course Java Machine Learning for Computer Vision together with many new improvements to previous posts applications in java.
Table of Contents
Face Recognition Applications
Face recognition has always been an important problem to solve due its sensitivity in regards to security and because it closely related to people identity. For many years face recognition applications were well known especially in criminology and searching for wanted persons with cameras and sometimes even using satellites. Now days, in deep-learning era face recognition is widely found from simple applications to unlocking your phone offering state of the art accuracy.
Lets first visit below the challenges related to the face recognition and than see how they are solved using deep learning techniques.
Face Recognition Challenges
Face Verification Problem
In previous posts we have already seen Image Classification and Object Detection where we were concerned mostly in finding out if an image was representing certain class like: is it a dog? is it a car? and also we saw how to mark the classified object with bounding box.
Now we are going one step further by uniquely identifying the objects.
So not just if the image is a car or not but additionally find out if it is specifically my car, your car or specifically someone’s car.(for animal classification we will need to find out if this John dog or Maria dog rather than just a dog)
Face verification is not different , just the logic is extended to human face.
The question is not if it is simply a human or not bur rather if: is a person with identification X or is company employ with some identification number….
Face Recognition Problem
Than we have the face recognition problem where we need to do the face verification for a group of people instead of just one. So to say if a new person is any of the persons in certain group.
Although face recognition and verification can be thought as same problem , the reason we treat it different is because face recognition can be much harder.
For instance lets suppose we achieved a face verification accuracy of 98% to verify if a person is the one it claims.
Which maybe is not that bad , now if we apply that model with 2% error rate to the face recognition with lets say 16 people it obviously is not going to work well since the error is 2% on 16 persons(32% error rate).
So for face recognition to work good and have reasonable accuracy taking also the sensitive nature of the problem we will need something like 99.99% accuracy.
One Shoot Learning Problem
Usually with face recognition we have only one photo of each of persons to recognize so next challenge is related to the problem know as “One shoot learning problem”.
So lets say that we want to recognize the employees as they come in.
Usually we really could have only one photo of each of the employees or maybe really a few of them in best case.
With the knowledge and application we have seen so fare we can of course feed all this photos to a neural network to learn and than have the network to predict classes for each of the employees. As much as may sound intuitive that will not really work well for below reasons.
- Convolution architecture seen so fare had great results but they were trained with thousand of images of just one type to millions of images in total. So we have really few data available.
- Additionally it will not scale well, for instance what will happen if we have a new employee? We need to modify the neural network by adding a new class and than we need to train the network all over again. In a few words each time we have a new employee the network needs re-train and modified.
Similarity Function
The high level solution to the above problems is implemented through the similarity function. Instead of try to learn to recognize specific persons faces as classes, what if we learn a function d which measures how similar or different two images are ?
d(face_1,face_2) -> degree of difference between face images
If the function would return a value smaller than a constant γ we know that the images are quite similar otherwise we know they are different so not same face or person.
Supposing that on left we have the employees faces and on the right a person coming , now what will happen is that for each of the comparison we will have a number which will be big when the images are different and small when they are similar. So for above this case we know that the person is the third employee on our group since has the lowest number below e.x γ=0.8.
Additionally this solution also scales well since a new person joining would mean just a new comparison to execute. We do not need to retrain since the neural network has learnt a generic function to distinguish faces rather than specific faces.
Similarity function is just a high level explanation of solution so lets see below two ways how is implemented in practice.
Siamese networks
We will still continue to use convolution architectures with many convolution layers and fully connected layers. With the exception that the last prediction layer (soft max layer)we will not be used or it will be cut.
We will feed the the first image X1 to the network then grab the last fully connected layer activations F(X1) and save in memory;
We will repeat the same for the second image X2 that we want to compare or the new coming employee. So now we have the encoded activations for the second image F(X2) saved in memory.
Notice that the network here stays the same for both images.
That’s where the Siamese name comes in since we use the same network(or to cloned networks) executions for both of images and in practice this happens in parallel.
Now the neural network for each iteration(through forward step and back-propagation ) will learn the function d as shown in the picture.
In a few words this will be the goal of our learning to shift the difference accordingly to small or large number depending if images are the same or different.
And only when encoded values are similar we will predict that two images are the same.Recalling from previous section this is exactly what refereed as the similarity function, the d denotes the distance so the distance between the activation of last layers of a very deep convolution network.
Triplet loss
Triplet loss is another great way to solve the face recognition problem and the one we will use for our java application. The name triplet come from the fact that we use three images as just one training sample. Similarly we will use the activations of last fully connected layer of some very deep neural network.
We are going to choose first the base image or the anchor image which will be used as comparison with two other images and through a forward step we get the activation of last layer F(A).
Together with a different image but representing the same person so called the positive image similarly we get the activation of last layer F(P). Recalling from our previous section we want our similarity function d(A,P) so the difference between anchor and positive image activations to be in this case as close to zero as possible since this images represent the same person after all.
Now keeping the same anchor image we are going to choose an image that represent a different person so a negative image F(N). Function d(A,N) so the difference between anchor and negative image activations in this case will bigger than zero so we want the difference to be big in order to emphasize the fact that this are different face images.
Triplet loss is explained in more details through diagrams on Java Machine Learning for Computer Vision by giving also a slightly more formal definition. Anyway after some simple math steps the combined formula for positive and negative case comparisons with anchor images looks like below:
ε is introduced to prevent neural network from finding weights such that the distance between images for negative and positive case can be the same (therefore the difference will be zero and easily satisfy the condition). In this way neural network has to work harder to make sure that at least there is a minimum distance ε between a positive and negative case.
Iteration by iteration neural network will try to learn the above function in order to satisfy the equation. It will try to push the positive case case difference (green equation) to lower values and try to push the negative case difference to larger values(red equation) by a difference value at least -ε (moving ε on the other side of equation).
Suggestions of Choosing triplets
Choosing triplets has a really big impact on how well and efficiently the network learns. So when we need to carefully choose the triplets following below guideline::
-
- When negative examples N are chosen randomly then the condition is easily satisfied:
- Choose Triplets that are hard to train. Positive cases(image of same face) that are as different as possible so d(A,P) will be big value and negative cases (different images) that are as similar to person face(anchor) so d(A,N) will be as low as possible.
- During the training you still need a few positive picture per person or triplets
- After training you can apply the one-shoot learning problem of having only one picture per person
Java Application
The code can be freely found on github as part of the video course. Although if offers all the flexibility to develop or borrow existing models, deeplearning4j face recognition has some known issues and is not offering yet pre-trained weights through transfer learning.
The Code
So in in order to build the java application we will need to use the weights from existing Keras OpenFace model found on github repository.
- As first step we need to build the neural network architecture which is based on Inception Networks( first build by GoogLeNet, detailed information can be found here). The full implementation code is not shown here as it is simple but long:
buildBlock3a(graph); buildBlock3b(graph); buildBlock3c(graph); buildBlock4a(graph); buildBlock4e(graph); buildBlock5a(graph); buildBlock5b(graph); graph.addLayer("avgpool", new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.AVG, new int[]{3, 3}, new int[]{1, 1}) .convolutionMode(ConvolutionMode.Truncate) .build(), "inception_5b") .addLayer("dense", new DenseLayer.Builder().nIn(736).nOut(encodings) .activation(Activation.IDENTITY).build(), "avgpool") .addVertex("encodings", new L2NormalizeVertex(new int[]{}, 1e-12), "dense") .setInputTypes(InputType.convolutional(96, 96, inputShape[0])).pretrain(true); /* Uncomment in case of training the network, graph.setOutputs should be lossLayer then .addLayer("lossLayer", new CenterLossOutputLayer.Builder() .lossFunction(LossFunctions.LossFunction.SQUARED_LOSS) .activation(Activation.SOFTMAX).nIn(128).nOut(numClasses).lambda(1e-4).alpha(0.9) .gradientNormalization(GradientNormalization.RenormalizeL2PerLayer).build(), "embeddings")*/ graph.setOutputs("encodings");
- Training face recognition neural networks is especially computationally expensive and in same time requires quite a lot of effort due to carefully triplet selections. Thanks to transfer learning we can use already trained neural network weights even from other languages and frameworks. In this way we can use all the face detection knowledge those neural networks gained during training. The weights are read from excel files found originally at the Keras Open Face and than copied to the java code. The full code can be found at this class on gitub repository(loadWeights). Some effort is needed in order to adapt the weights from Keras to deeplearning4j internal organisation of convolution layers and dense. Notice how for dense first we need the weights(w) and than the bias(b) while for convolution is other way around.
static void loadWeights(ComputationGraph computationGraph) throws IOException { Layer[] layers = computationGraph.getLayers(); for (Layer layer : layers) { List<double[]> all = new ArrayList<>(); String layerName = layer.conf().getLayer().getLayerName(); if (layerName.contains("bn")) { all.add(readWightsValues(BASE + layerName + "_w.csv")); all.add(readWightsValues(BASE + layerName + "_b.csv")); all.add(readWightsValues(BASE + layerName + "_m.csv")); all.add(readWightsValues(BASE + layerName + "_v.csv")); layer.setParams(mergeAll(all)); } else if (layerName.contains("conv")) { all.add(readWightsValues(BASE + layerName + "_b.csv")); all.add(readWightsValues(BASE + layerName + "_w.csv")); layer.setParams(mergeAll(all)); } else if (layerName.contains("dense")) { double[] w = readWightsValues(BASE + layerName + "_w.csv"); all.add(w); double[] b = readWightsValues(BASE + layerName + "_b.csv"); all.add(b); layer.setParams(mergeAll(all)); } } }
Basically this are the main parts of the application apart from Java SWING GUI and other low level utilities which can be freely explored in the code.
Running Application & SHowcase
It is possible to run the from source by simply executing the RunFaceRecognition class. After running the application a Jaa GUI will be shown as below:
It was possible to register your own images(Register new Member button) which will be shown as member below and than try if other picture of new member(Choose Face Image) will match or not(Who Is? button ).
Limitations
In future further consolidation may needed in the way we load the weights. So right now the model may still need some tuning so please stay tuned as the code it will continually improved to a state of the art accuracy.
Please notice that the open face model is quite small comparing to real systems so the accuracy may not be the best but it is quite promising and it clearly shows the power of the explored concept on this post.
ENJOY :)!