YOLO – ramok.tech

Are you Java Developer and eager to learn more about Deep Learning and his applications, but you are not feeling like learning another language at the moment ? Are you facing lack of the support or confusion with Machine Learning and Java?

Well you are not alone , as a Java Developer with more than 10 years of experience and several java certification I understand the obstacles and how you feel.

From my experience I know what obstacles a Java software engineering faces with the Deep Learning so I can be of a great help to you in making the journey with deep learning an exciting experience.

In this post we are going to build a Java Real Time Video Object Detection Application for Car Detection, the key component in autonomous driving systems. In previous post we were able to build an image classifier(cats&dogs) while now we are going to also detect the object(cars,pedestrians) and additionally mark them with bounding boxes(rectangles). Feel free to download code or run the application with your own videos(short live example).

Object Detection Nature

Object Classification

First we have the problem of Object Classification in which we have an image and want to know if this image contains any of particular categories like : image contains a car VS image doesn’t contain any car

We saw how to build an image classifier in previous post using existing architecture like VGG-16 and transfer learning.

Object Localization

Now that we can say with high confidence if an image has a particular object or not, rises the challenge to localize the object position in the image. Usually this is done by marking the object with a rectangle or bounding box.

Beside classification of the image we need to additionally specify the position of the object in the image.This is done by defining a bounding box.

A bounding box is usually represented by the center( b^x, b^y ) , rectangle height ( b^h ), rectangle width( b^w ).

Now we need to define this four variables in our training data per each of the objects in the images. Than the network will output not only the probability for the image class number(20% cat[1], 70% dog[2], 10% tiger[3]) but also the four variables defining the bounding box of the objects.

Notice that just by providing the bounding boxes points(center,width,height) now our model is able to predict more information by giving us a more detailed view of the content.

It will not be hard to imagine that adding more points in the image can provide us even greater insight into the image. Like e.g adding more points position in the human face (mouth, eyes)can tell us if the person is smiling, crying , angry or happy.

Object Detection

We can go even a step further by localizing not only one object but rather multiple objects in the image and this will lead us to the Object Detection Problem.

Although the structure does not change much the problem here becomes a harder because we need more data preparation(multiple bounding boxes per image).

In principle this problem is solved by dividing the image into smaller rectangles and per each rectangle we have the same additional five variables we already saw P^{c ,}( b^x, b^y ) , b^h , b^w and of course the normal prediction probabilities(20% cat[1], 70% dog[2]). We will see more details on this later but for now lets say the structure is the same as Object Localization with the difference we have multiple of that structure.

Sliding Window Solution

This is a quite intuitive solution that one can come up by his/her self. The idea is that instead of using general car images we crop as much as possible so the image contains only the car.

Than we train a Convolution Neural Network similar to VGG-16 or any other deep network with the cropped images.

This works quite well with the exception than now the model was trained to detect a images that have only cars so it will have troubles with real images since they contain more than just a car(trees , people, sings…).

The size of real images is way bigger than the cropped images and can contain many cars as well.

To overcome those problems we can analyse only a small part of the image and try to figure out if that part has a car or not. More precisely we can scan the image with a sliding rectangle window and each time let our model to predict if there is a car in it or not. Lets see an example :

To summarize we use a normal Convolution Neural Network( VGG-16 ) to train the model with crop images of different sizes and than use this rectangle sizes to scan the images for the objects(cars). As we can see by using different sizes of rectangles we can figure out quite different shape of cars and their positions.

This algorithm is not that complex and it works. But anyway it has two downsides:

First is the low performance. We need to ask the model for a prediction a lot of times. Each time a rectangle moves we need to execute the model in order to get a prediction.
And this is not all, additionally we need to do it all over again for different rectangles size as well. Maybe one way to address the performance is to increase stride of rectangles(bigger steps) but than we may fail to detect some of the objects.
In past the models used to be mostly linear and have features design/mixed by hand so the prediction was not that expensive therefore this algorithm used to do just fine. With nowadays network sizes(138 million parameters for VGG-16) this algorithm is slow and it almost not useful for real time video object detection like autonomous driving.
Another way to see why the algorithm is not very efficient is to notice how when moving the rectangle(right and down) a lot of shared pixel are not being reused(pixels when colors get mixed) but they get just re-calculated all over again. Next section we will see a state of the art technique overcoming this problem by using convolution.

2.Even with different bounding box sizes we may fail to precisely mark the object with a bounding box. The model may not output very accurate bounding boxes like the box may include only a part of the object.Next sections will explore YOLO(You Look Only Once) algorithm which solves this problem for us.

Convolutional Sliding Window Solution

We saw how Sliding Window had performance problems due to the fact that it didn’t reuse many of already computed values. Each time the window was moved we had to execute all the model(million of parameters) for all pixels in order to get a prediction. In reality most of the computation there could be reused by introducing Convolution.

Turn Fully Connected Layers Into Convolution

We saw in previous post that image classification architectures regardless with their size and configuration in the end they used to feed a fully connected Neural Network with different number of layers and output several prediction depending on classes.

For simplicity we will take a smaller network model as example but anyway the same logic is valid of any convolutional network(VGG-16, AlexNet).

For a more detailed explanation of convolution and his intuition please have a look at one of my previous posts. This simple network takes as input a colored image(RGB) of size 32 X 32 X 3. It uses a Same Convolution(this convolution leaves first two dimensions width X height unchanged) 3 X 3 X 64 to get an output 32 X 32 X 64(notice how the third dimension is same as convolution matrix 64, usually increased from input). It uses a Max Pooling layer to reduce width and height and leaving the third dimension unchanged 16 X 16 X 64. After that we feed a Fully Connected Neural Network with two hidden layers 256 and 128 neurons each. In the end we output probabilities(usually using soft-max) for 5 classes.

Now lets see how we can replace Fully Connected Layers to Convolution Layers while leaving the mathematical effect the same(linear function of the input 16 X 16 X 64).

So what we did is just replace Fully Connected layers with Convolution Filters. In reality 16 X 16 X 256 convolution filter is 16 X 16 X 64 X 256 matrix ( why? multiple filters) .Third dimension 64 is always same as the input third dimension 16 X 16 X 64 sofor the sake of simplicity it is referred as 16 x 16 x 256 by omitting the 64. This means that actually this is equivalent to Fully Connected Layer because :
out:1 X 1 X 256 = in:[16 X 16 X 64] * conv:[16 X 16 X 64 X 256]
so every element of the output 1 X 1 X 256 is a linear function of every element of the input 16 X 16 X 64.

The reason why we bothered to convert Fully Connected(FC) Layers to Convolution Layers is because this will give us more flexibility in the way output is reproduced. With FC you will always will have the same output size which is number of classes.

Convolution Sliding Window

To see the great the idea behind replacing FC with convolution filter we will need to take in input image that is bigger than original one of 32 X 32 X 3. So lets take an input image of 36 X 36 X 3:

So this image has 4 columns and 4 rows(with green 36 X 36) more than original image(blue 32 X 32).If we would use sliding window with Stride=2 and FC than we need to move the original image size 9 times to cover all (e.x 3 moves shown with black rectangle) therefore execute model 9 times as well.

Lets try now to apply this new bigger matrix as input to our new model with Convolution Layers only.

Now as we can see the output changed from 1 X 1 X 5 to 3 X 3 X 5 in comparison to FC where the output will always be 1 X 1 X 5. Recalling in the case of Sliding Window we had to move sliding window 9 times to cover all image , wait isn’t that equal to 3X3 output of convolution? Yes indeed those 3X3 cells each represent the Sliding Windows 1 X 1 X 5 classes probability prediction results! So instead of having only one 1 X 1 X 5 as output now with one shot we have 3 X 3 X 5 so all 9 combinations without needed to move and execute 9 times the sliding window.

This is really sate of the art technique as we were able just in one pass to get all 9 results immediately without needed to execute model with millions of parameters several times.

YOLO(You Look Only Once)

Although we already addressed the performance problem by introducing convolution sliding window our model still may not output very accurate bounding boxes even with several bounding boxes sizes. Lets see how YOLO solves that problem as well.

First we normally go on each image and mark the objects we want to detect. Each object is marked by a bounding box with four variables center of the object( b^x, b^y ) , rectangle height ( b^h ), rectangle width( b^w ). After that each image is split into smaller number of rectangles(boxes) , usually 13 X 13 rectangles but here for simplicity is 8 X 9.

The bounding box(red) and the object can be part of several boxes(blue) so we assign the object and the bounding box only to the box owning the center of the object(yellow boxes). So we train our model with four additional(beside telling the object is a car)variables (b^x, b^y ,b^h ,b^w ) and assign those to the box owning the center b^x, b^y . Since the neural network is trained with this labeled data it also predicts this four variables(beside what object is) values or bounding boxes.

Instead of scanning with predefined bounding box sizes and trying to fit the object we let the model learn how to mark objects with bounding boxes therefore the bounding boxes now are flexible(are learned) . This way the accuracy of the bounding boxes is much higher and flexible.

Lets see how we can represent the output now that we have additional 4 variables(b^x, b^y ,b^h ,b^w ) beside classes like 1-car,2-pedestrian… In reality there is added also another variable P^cwhich simply tells if the image has any of the objects we want to detect at all.

P^c=1(red) means there is at least one of the objects so is worth to look at probabilities and bounding box.
P^c=0(red) image has none of the objects we want so we do not care about probabilities or bounding box

Bounding Box Specification

We need to label our training data in some specific way so the YOLO algorithm will work correctly. YOLO V2 format requires bounding box dimensions b^x ,b^yand b^h ,b^wto be relative to original image width and height. Lets suppose we have an image 300 X 400 and the bounding box dimension are B^width =30 , B^height=15 , B^x=150 ,B^y=80. This has to be transformed to:
B^w =30/300 , B^h=15/400 , B^x=150/300 ,B^y=80/400

This post shows how to label data using BBox Label Tool with less pain. The tool labels bounding boxes a bit different(gives up left point and lower right point) from YOLO V2 format but converting is fairly straight forward task.

Regardless how YOLO requires the labeling of training data , internally the prediction is done a bit differently.

A predicted bounding box by YOLO is defined relatively to the box that owns the center of the object(yellow). The upper left corner of the box start from (0,0) and the bottom right (1,1). So the center (b^x, b^y) is for sure on range 0-1(sigmoid function makes sure) since the point is inside box. While b^h ,b^w are calculated in proportion to w and h values(yellow) of the box so values can be greater than 1(exponential used for positive values). In the picture we can see that the width b^w of the bounding box is almost 1.8 the size of the box width w. Similarly b^h is approx 1.6 the size of box height h.

After prediction we see how much the predicted box intersects with the real bounding box labeled at the beginning. Basically we try to maximize the intersection between them so ideally the predicted bounding box is fully intersecting to labeled bounding box.

In principle that’s it! You provide more specialized labeled data with bounding boxes(b^x, b^y ,b^h ,b^w ) , split the image and assign to the box containing the center(the only responsible for detecting the object) , train with ‘Convolution Sliding Window Network’ and predict the object and his position.

Two More Problems

Although we are not going to give a very detailed explenation in this post in reality there are yet two more small problems to solve:

Even if in the training time the object is assign to one box(the one containing the object center) at test time (when predicting) it may happen several boxes(yellow) think they have the center of the object(with red) therefore defining their own bounding boxes for same object. This is solve by Non-max Suppression algorithm. Currently deeplearning4j does not provide an implementation so please find at github a simple implementation(removeObjectsIntersectingWithMax). What it does is first choose as prediction the box with maximum P^cprobability(yes it has not only 1 or 0 values but rather in the 0-1 range). Than every box that intersects above certain threshold with that box is removed. It starts again the same logic until there no more bounding boxes left.
Since we are predicting multiple objects (car , pedestrians, traffic lights ) it may happen the center of two or more objects is one box. This cases are solved by introducing Anchor Boxes. With anchor boxes we choose several shapes of bounding boxes we find more used for the object we want to detect. YOLO V2 paper is doing this with K-Means algorithm but it can be done also manually. After that we modify the output to contain the same structure we saw previously(P^c,b^x, b^y ,b^h ,b^w,C1,C2…) but for each of chosen Anchor Boxes shapes . So we may have something like this now:

Application

Training deep networks takes a lot of effort and requires significat data and processing power. So as we did in previous post we will use transfer learning. This time we are not going to modify the architecture and train with different data but rather use the network directly.

We are going to use Tiny YOLO ,citing from site:

Tiny YOLO is based off of the Darknet reference network and is much faster but less accurate than the normal YOLO model. To use the version trained on VOC:
wget https://pjreddie.com/media/files/tiny-yolo-voc.weights
./darknet detector test cfg/voc.data cfg/tiny-yolo-voc.cfg tiny-yolo-voc.weights data/dog.jpg
Which, ok, it’s not perfect, but boy it sure is fast. On GPU it runs at >200 FPS.

Current release version of deeplearning4j 0.9.1 does not offer TinyYOLO but the 0.9.2-SNAPSHOT it does. So first we need to tell maven to load SNAPSHOT:

<repositories>
    <repository>
        <id>a</id>
        <url>http://repo1.maven.org/maven2/</url>
    </repository>
    <repository>
        <id>snapshots-repo</id>
        <url>https://oss.sonatype.org/content/repositories/snapshots</url>
        <releases>
            <enabled>false</enabled>
        </releases>
        <snapshots>
            <enabled>true</enabled>
            <updatePolicy>daily</updatePolicy>
        </snapshots>
    </repository>
</repositories>

<dependencies>

    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-core</artifactId>
        <version>${deeplearning4j}</version>
    </dependency>

    <dependency>
        <groupId>org.nd4j</groupId>
        <artifactId>nd4j-native-platform</artifactId>
        <version>${deeplearning4j}</version>
    </dependency>

Than we are ready to load the model with fairly short code:

private TinyYoloPrediction() {
    try {
        preTrained = (ComputationGraph) new TinyYOLO().initPretrained();
        prepareLabels();
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

prepareLabels is just using labels from the dataset PASCAL VOC that used to train the model. Feel free to run preTrained .summary() to see model architecture details.

The video frames are captures using JavaCV with CarVideoDetection:

FFmpegFrameGrabber grabber;
grabber = new FFmpegFrameGrabber(f);
grabber.start();
while (!stop) {
    videoFrame[0] = grabber.grab();
    if (videoFrame[0] == null) {
        stop();
        break;
    }
    v[0] = new OpenCVFrameConverter.ToMat().convert(videoFrame[0]);
    if (v[0] == null) {
        continue;
    }
    if (winname == null) {
        winname = AUTONOMOUS_DRIVING_RAMOK_TECH + ThreadLocalRandom.current().nextInt();
    }

    if (thread == null) {
        thread = new Thread(() -> {
            while (videoFrame[0] != null && !stop) {
                try {
                    TinyYoloPrediction.getINSTANCE().markWithBoundingBox(v[0], videoFrame[0].imageWidth, videoFrame[0].imageHeight, true, winname);
                } catch (java.lang.Exception e) {
                    throw new RuntimeException(e);
                }
            }
        });
        thread.start();
    }

    TinyYoloPrediction.getINSTANCE().markWithBoundingBox(v[0], videoFrame[0].imageWidth, videoFrame[0].imageHeight, false, winname);

    imshow(winname, v[0]);

So what the code is doing is getting frames from the video and passing to the TinyYOLO pre trained model. From there image frame is first scaled to 416X416X3(RGB) and than is given to TinyYOLO for predicting and marking the bounding boxes:

public void markWithBoundingBox(Mat file, int imageWidth, int imageHeight, boolean newBoundingBOx,String winName) throws Exception {
    int width = 416;
    int height = 416;
    int gridWidth = 13;
    int gridHeight = 13;
    double detectionThreshold = 0.5;

    Yolo2OutputLayer outputLayer = (Yolo2OutputLayer) preTrained.getOutputLayer(0);
   
        INDArray indArray = prepareImage(file, width, height);
        INDArray results = preTrained.outputSingle(indArray);
        predictedObjects = outputLayer.getPredictedObjects(results, detectionThreshold);
        System.out.println("results = " + predictedObjects);
        markWithBoundingBox(file, gridWidth, gridHeight, imageWidth, imageHeight);
    imshow(winName, file);
}

After the prediction we should have ready the predicted values of bounding box dimensions.We have implemented also Non-Max Suppression (removeObjectsIntersectingWithMax) because as we mention YOLO at testing time predicts more than one bounding boxes per object. Rather than using b^x, b^y ,b^h ,b^w will use topLeft and bottomRight points. gridWidth and gridHeight are the number of the small boxes we split our image into, which in our case is 13X13. w,h are original image frame dimensions.

private void markWithBoundingBox(Mat file, int gridWidth, int gridHeight, int w, int h, DetectedObject obj) {

    double[] xy1 = obj.getTopLeftXY();
    double[] xy2 = obj.getBottomRightXY();
    int predictedClass = obj.getPredictedClass();
    int x1 = (int) Math.round(w * xy1[0] / gridWidth);
    int y1 = (int) Math.round(h * xy1[1] / gridHeight);
    int x2 = (int) Math.round(w * xy2[0] / gridWidth);
    int y2 = (int) Math.round(h * xy2[1] / gridHeight);
    rectangle(file, new Point(x1, y1), new Point(x2, y2), Scalar.RED);
    putText(file, map.get(predictedClass), new Point(x1 + 2, y2 - 2), FONT_HERSHEY_DUPLEX, 1, Scalar.GREEN);
}

After that using another thread(beside the one playing the video) we update the video to have the rectangle and the label for the object it was detected.

The prediction is really fast(real-time) considering that is run on CPU, on GPUs we will have even better real time detection.

Running Application

Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer. Feel to try with your own videos.

It is possible to run the from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java.

After running the application you should be able to see below view: