Java Object Tracking for Cars

In this post we are going to develop a java application for tracking cars in a video using deeplearning4j. Considering the achieved accuracy in the new era of deep learning tasks such image recognition or even object detection are considered as solved problems.  Because of that a lot of attention and effort is directed towards more difficult problems like the fascinating problem of Object Tracking. Please feel free to check out the  code at github as part of Java Machine Learning for Computer Vision additionally please find find a video sample of the running application.

Object Tracking is quite challenging problem so will also address quite a few deep learning challenges like training a neural networks almost from scratch , performance(make run possible in CPU), accuracy , testing managing data sets and  will see how several intuitions could actually produce poor results.

Object Tracking Problem

In high level view with object tracking we are able to mark an object with a bounding box and than follow this object as it moves during the video by continuously marking it with bounding box. During this post we will track multiple objects of one type : cars.

When talking about object tracking the first thing that changes is that our data now are actually videos(we track movements after all) rather than pictures. Anyway videos are composed by many frames(pictures) which are showed into the display fast enough so we can perceive as a ‘video’ or moving pictures. So lets try to define the problem a bit more clearly:

  • For the video frame currently shown into the display we need  to mark all the cars with bounding boxes(object detection) and assign an ID

  • When the next frame it is shown into the display beside marking all the cars  with bounding boxes we need to understand that this new bounding boxes are actually point out to the same cars of previous frame and assign the same IDs

  • It can happen that some of the cars in the next frame were not in the previous frame so we need to assign new IDs and of course mark with bounding boxes

Object Detection

As we saw in one of the previous posts  Java Autonomous Driving- Car Detection fortunately we already know how to solve the problem of object detection by using the YOLO algorithm. So for each of the frames in the video we already explored in details how to mark cars with bounding boxes with amazing accuracy. The only difference is that in this post we are going to use the real YOLO instead of Tiny YOLO used on previous post so higher accuracy is achieved

Car Identification Between Frames

Identifying the fact that the car(marked with bounding box) in the current frame is the same car as in the car in the previous frame is actually the main problem which will enable us to track a car as it moves frame by frame. So lets explore some of the possible solutions:

Bounding Boxes Intersection Over Union

We already shortly saw how YOLO uses Intersection Over Union to  solve the problem when multiple bounding boxes were predicted by the model for same object(multiple boxes think they have he center of object , explained in details on Java Machine Learning for Computer Vision chapter 4). 

Similarly we could say that if two bounding boxes(current frame box and previous frame box) share more than certain percentage(0.5) together we know that actually that is the same object. Surprisingly as simple as this solution may sound it works quite well especially if the cars are not moving extremely fast and the model is fast enough(using GPU would not be a problem).

Anyway it has one problem which unfortunately prevents us to use it alone. We need to keep previous bounding boxes in memory in order to compare with the new coming bounding boxes. Additionally since the model may fail to predict bounding boxes for all the cars in each of the frames(F1-> 4 cars originally predicted, F2->3 of 4 cars were predicted for some reason , F3->4 of 4 cars, the lost car come back as prediction) we need to keep bounding boxes  from several previous frames(F1,F2 and current F3).

This may cause issues if a new car happens to be at the same position as the bounding box already in memory from and old car and shares more than 0.5 with that bounding box. So in this case since the bounding boxes share more than 0.5 we will say this is the same car but actually they are not. It just happened to be on the same position with a previous car bounding box position. Feel free to notice the problem at this video.

Cropping Cars from Bounding Boxes

Since we have the bounding boxes already from YOLO one could easily cut  that part of the frame marked by the bounding box and actually have small car images.  Although it may sound intuitive, comparing this small image pixels directly doesn’t work well because even slightly changes in light intensity will cause the pixel to be different or maybe difference in the car position(cropped a bit more in the left,right,up or down). We have already seen two possible solution to the problem when exploring Java Neural Style Transfer and Java Face Recognition so lets visit them below

Deep Convolution Layers Representation

At neural style transfer post we gained the intuition already on what are convolution neural networks learning.  Deeper we go in the layers bigger part of the image neurons are detecting therefore capturing high level features of the image in comparison to low level layers capturing rather small part of image or low level features like rectangles, colors, lines,shapes.

So since deep convolution layers already learned how to detect image features and how to handle light intensity and other issues  we could instead compare convolution layers output of the images rather than pixels directly. Using transfer learning we could easily re-use neural networks like VGG-16 which were trained in millions of images. When the distance between convolution representation of the images is smaller than a distance we know that this is the same car.

As we will see this solution actually works fairly well(please have a look at this video) but not good enough. Although network is well trained,it is very general and not specialized for moving cars. Maybe considering different convolution layers for the same image as we did in the neural style transfer will help a bit but still is clear that a more specialized network for the problem in hand can help.

Image Encoding Representations

In way if we think about the problem(uniquely identify if the car in current frame is the same as the car in previous frame) is almost identical with the problem of face recognition we explored at Java Home Made Face Recognition.

Recalling from previous post: So we are not  interested to just know if the image is a car or not but additionally find out if it is specificallythe same car in previous frame.(for animal classification we will need to find out if this John dog or Maria dog rather than just a dog). Face verification is not different , just the logic is extended to human face.The question is not if it is simply a human or not bur rather if:  is a person with identification X or is company employ with some identification number….

This means intuitively we could use the same way but with the difference than we need to train a neural network for car recognition rather than face recognition.

Although looking at the picture at first look it seems like the only difference with Convolution Representation solution is that we use neurons layer outputs instead of the Convolution Layers outputs for the comparison or distance, there is a big difference in the way we train the networks. VGG-16 or similar networks are trained with classical way of using classes and a soft-max layer while Face Recognition Network is actually trained using the sate of the art technique of triplet lost , explained in details at previous post.

Triplet loss has shown great results but unfortunately choosing the triplets is not trivial process and expensive. For that reason we are going to use a similar way but easier introduced originally by this paper  A Discriminative Feature Learning Approach for Deep Face Recognition called Center Loss.In principle Center loss is fairly easy loss function as it is trying to change neural network weights so that embeddings of an image Xi is coming each iteration as close as possible to the average embeddings of Xi image class.

In the end we are going to train a neural network to identify cars as classes using Center Loss. During training if we have 200 cars and per each of them lets say 10 images taken from video the neural network will have to predict(using softmax) 200 classes as output having 2000 samples in total. Anyway during testing we are going to throw the softmax layer(200 classes) and use only the embedding layer(F(X)).  More details will follow on the implementation section and code.

Data

The first challenge with object tracking is data because now our data are actually videos rather than pictures. Additionally because of the complexity of marking objects with bounding box(beside classification as e.x is a car, person …)  the amount of data and quality is quite smaller comparing to other problems like speech recognition or  image recognition.

Less data means that we need more hacks , hand engineering and complex architectures in order to achieve good results which is quite common among computer vision applications.

Hack Existing Data Sets

A wise thing to do when solving a deep learning problem is to look around for already existing data-sets. After some investigation we found that this data set Multi-View Car Dataset  fits quite well. Basically this data-set has 20 sequences of cars as they rotate by 360 degrees and this serves well to our problem because during driving the a car meets other cars in almost 360 degrees. 

Anyway in practice we do not have the luxury of such good photos but instead most of the time we have only a part of the car or frame by frame the view of the car is cut as below:

This is solvable problem because we can anytime modify each of the photos by creating new ones which are cut horizontally,vertically or even resized and that is exactly what TransformRotatingImages class does:

Before modifying the images we feed the images to YOLO (cropAllWithYOLOBoundingBox) which in his hand will produce a bounding box for the image car. After that using the bounding box we are going to cut only that part of the image(with box size and location) by creating the new one(which will be later modified). In this way we produce more close to practice images than the original ones, in total 13811 images.

There is also a smaller version(more on this on training section) of this data set which takes only one in three images 758.

Create Data Set From Scratch

Although we were able to produce a good deal of data in the end there is only 20 cars. If only trained with 20 cars means that the neural network would have really few type of cars and we know that neural networks(no matter how deep) have trouble with not seen data.

So we will need to produce some data by our own. One of the ways to do that is to do something similar to what we did when modify rotating images when cropping images with YOLO. But instead now we are going to feed videos to YOLO and in the same way for each bounding box we are going to cut the frame(image) on the bounding box size and location(for more details please have a look at ProduceDataFromVideo).  In the end of this process we will have a lot of cars images but unfortunately this data is unlabeled.

By unlabeled we mean that although we have the cars images there is no way out of the box to know which images are the same car. Yes this process is manual and painful but this is part of the deep learning life :).

Usually when it comes to data is always suggested to not invest to much effort immediately but rather start small and increase slowly when there is actually good progress each time more data are provided. It is quite intuitive to think more data will help but this is could be a  mistake(costly sometimes) since what we need is rather a better architecture.

So we start with 96(1400 images) classes than 190 classes (2200 images) and finally 611 classes(6300 images). This data set set is free of use although when using it a reference back is appreciated.

Training Network and Testing

By now we already have the data and also a general idea together with model architecture we think it may work.

Lets summarize shortly how the solution looks like so fare:

  • Input is going to be a video and YOLO is going to detect cars per each of the video frames in almost real time

For each of the YOLO predicted bounding boxes we are going to crop out only that part of the frame so to have only the car image. 

 

  • Each of the cropped images are given as input to a new neural network which is trained with Center Loss to recognize cars. Than we are going to save the embeddings produce by the neural network per each of cropped images in memory.

  • For every new frame we are going to compare new embeddings with existing embeddings in memory. When the distance is less than a threshold we know that this is the same car(so the same ID is shown) otherwise this is a new car(new ID assigned). More specifically we are going to search for the minimum distance between  embeddings(current frame with previous frame) that is less than a threshold.

Training

Even not considering IO operation like image crop , image scaling because we are using two neural networks(YOLO very deep and the new network) the solution is computational expensive especially on CPU.  So we need to be carefully when choosing the architecture for the new neural network.

After some investigation resulted that the model which fits best(especially on CPU) is CIFAR-10  because the model is small with only 2.786.890parameters and the input images is also small 32 x 32 .Additionally we can use transfer learning since the existing model is already trained for cars. This will not help much since our problem is a bit different and we will use  different loss function together with an embedding layer but at least we solve the problem of a good network initialization.

Testing

There is one last step before jumping into training results , the testing method has to be defined. Usually testing the model is easy because we just have to choose a percentage of the existing data set and use it as development data set or test data-set.

However in our case testing the model is different from training it because during training we used a soft-max to predict a certain number of cars taken from video like(96 classes,190 classes,611 classes). Once the training is finished with satisfactory precision on the training data set we need to test how the model is doing with unseen data. Since the model has not seen the cars(in test data set) we cannot test the model by asking in which of the training classes a car(in test data set) belongs to(we already know the answer, none of classes).

So we need to test using the same way we are going to use the model after  YOLO predicts. So by calculating the distance between image embeddings and throw away the soft-max layer during training. There different type of testing we will perform:

  1. For each of the car pictures on same class(same car) calculate the embeddings distances between them and once the distance is bigger than a threshold we know that model is not doing good since this is the same car. Take car images in the same class in the sequence showed in the video and compare them sequentially. So the last image embeddings is compare with previous one.

  2. For each of the images compare with all other images not in in the same class and once the embeddings distance is smaller than a threshold than we know that the model wrongly predicted as same car

Results

For more details about training please refer to TrainCifar10Model. Please find below the model build with deeplearning4j trained for 611 classes:

Test data set is composed of 940 car images produced by videos never seen by the model during training.

Training Results 1

Dataset from videos,  96 classes , 1400 samples , Threshold=8.5

wrongPredictionsInOneClassSequentially= 348/940, wrongPredictionsWithOtherClasses= 156/(940*940)

Training Results 2

Dataset from videos,  190classes , 2200 samples, Threshold=8.5

wrongPredictionsInOneClassSequentially= 269 / 940, wrongPredictionsWithOtherClasses= 290/ (940*940)

Notice how the wrong prediction for sequence images decreases from 348 to 269, although for some reason the first model is doing good on predicted correct classes in general(156 vs 290). Well the results tells us t hat is a good chance that more data may help further:

Training Results 3

Dataset Rotating Images  611 classes , 6300 samples, Threshold=8.5

wrongPredictionsInOneClassSequentially= 311/ 940, wrongPredictionsWithOtherClasses= 52/ (940*940)

It looks like more data helped but not maybe as we expected. Definitely we are doing much better in distinguishing between different cars(52 VS 290 VS 156) but somehow the model is not doing that good when it comes to detect same cars sequentially(311 VS 269). Anyway since we are doing good on detecting different we can afford to increase threshold to 0.9and improve sequentially detection with the cost of increasing wrong detection for other classes:

wrongPredictionsInOneClassSequentially= 260/ 940, wrongPredictionsWithOtherClasses= 148/ (940*940)

The is actually the best result achieved so fare and this model can be trained further as it is trained to only achieve 89% accuracy on training data set in comparison to 99% with other models 189 and 96.

Training Results 4

Dataset from rotating-images,  25 classes , 13.811 samples, Threshold=8.5

wrongPredictionsInOneClassSequentially= 146/ 940, wrongPredictionsWithOtherClasses= 15076/ (940*940)

As we can see the rotating data set did not offer higher accuracy especially on distinguishing between different cars. It looks like the model is not generalizing well since after all it has seen only 25 cars although in way more positions/angels(360).

Training Results 5

We can of course join this data set with the video data set and train the model on both data sets. After all rotating images for sure offer some value since the car is captured in many angels.

Well merging 13.811 images from rotating images data set with 6300 images for video data set actually resulted in very poor results… The reason behind that is that rotating images data set has a many examples for on car 13811 for only 25 cars while video data set has 6300 for 611 cars. So during the training the neural network is totally focus only on rotating images data set and ignoring video data. Because optimizing video data set images has little impact on the cost function neural network focuses on the rotating images data set.

Because of this  reason we created a smaller rotating images data set with 758 images and 20 cars. And than we merged with the video data set getting results as below:

wrongPredictionsInOneClassSequentially= 275/ 940, wrongPredictionsWithOtherClasses= 16000/ (940*940)

Although augmenting the data usually helps clearly the rotating cars data set is not helping to achieve better results , actually the accuracy degrades.

Final Solution

Looking at the test results the best score is achieve by the                     Training Result 3(260,148) so we will choose that model during the video car tracking application. For each of the frames we are comparing current embeddings of bounding boxes with previous bounding boxes embeddings already saved in memory. Only when the distance is below a threshold(0.9) we know that this is the same car. Actually in reality we are looking for the minimum distance among embeddings below the threshold(0.9).

Running the application it will produce this video as result. In comparison to VGG16 video seen previously it performs slightly better , it can be easily seen that actually the results are not good enough.

There are a lot of reasons for that like among others : not enough data(now days hundred thousand to millions of images are normal), network is not deep enough(CIFAR 10 is really small), center loss tuning(we kept 1e-4 all the time but different values can increase the model generalization).

Although getting more data and using a deeper network will most probably improve the accuracy, that is quite time consuming so is out of the scope for this blog. Anyway fortunately there is a solution which as we will see it will greatly improve the accuracy.

At the beginning(Bounding Boxes Intersection Over Union) we saw how intersection over union can help us track the car since frame by frame cars bounding boxes share a lot(>0.5) for the same cars. Anyway  intersection over union cannot look inside bounding box so it can wrongly consider as the same car different cars that their bounding boxes happened to share more than 0.5.

What we can do is enrich Intersection Over Union solution with embeddings comparison. So when two bounding boxes share more than 0.5 we are not directly marking as same car but additionally measure the embeddings distance and only when the distance is below a threshold we mark as same car. Please find video result after the change.

The reason why it works better is because we are not applying embeddings distance for every bounding box in memory but only for a small part of them when bounding boxes share more than 0.5. Beside this lowers the number of distance comparisons therefore lowering the error rate also in same time increases the chances for success since when two bounding boxes share more than 0.5 most probably this is the same car.

CPU Architecture Optimization

Considering our solution , we have two convolutional neural networks the YOLO model and CIFAR-10. So even when ignoring other IO operations like cutting the pixels inside bounding boxes and feed to CIFAR-10 the solution is computational expensive especially on CPU.

As consequence running the solution in real time on CPU(using GPU works fine) is not possible(each frame needs 1.1-1.5 seconds to be processed) without some modifications.

Since executing YOLO and CIFAR-10 takes a lot of time we are going to separate that processing in another thread while the process of showing frames in video runs yet independently in another thread(more info VidoePlayer).

So we have the video or main thread which shows the frame into the display together with already predicted  bounding boxes(no network execution happens here) and additionally also queues the frame into the frames queue:

Notice that we slow down the main thread a bit by using Thread.sleep() depending on CPU power this can help to give some space for the other thread(YOLO) to work on.

The second thread called YOLO thread picks frames from the queue and predicts bounding boxes and also the bounding boxes pixels are feed to the other network CIFAR-10 to get the embeddings for of the bounding boxes.

Notice that the distance calculations to figure out if the is the same car or not are done in Video Frame since is a fast process.

In this way we don’t stop frames from showing into the display and have rather a non freezing normal video shown and in same time have bounding boxes together with IDs for tracking. Of course since this process is asynchronous the prediction is not really real time. While YOLO thread is working on predicting bounding boxes for one of the frames the Video thread is already showing newer frames and put into the queue as well. But as we have seen practically this works fairly well especially if the cars are not moving fast or changing position suddenly fast.

Application

It is possible to run the from source by simply executing the RunCarTrackingclass. After running the application a Java GUI will be shown as below:

We can choose the model and the best suggested model to choose is 611_epoch_data_e512_b256_980.zip.

Additionally is possible to change the threshold as well so feel free to play with other value and notice how the tracking accuracy changes.

The last option we can choose is the strategy used for tracking:

  • Use IoU and Encodings , this option is the best one as it is using intersection over union of bounding boxes together with embeddings distances.

  • Use Only IoU , this options is using only intersection over union

  • Use Only Encoding, this option uses only embeddings

Improvements and Future Considerations

Although considering the small training data set and a modest neural network as CIFAR-10 the results and accuracy was OK, for real world usage we may need more context when training and predicting.

What we mean by context is that we as human when classifying the next picture as same car take in consideration also the previous picture of the car or maybe few previous picture of the car. The neural networks trained so fare don’t have a memory as consequence they don’t take in consideration previous picture when predicting the current picture embeddings.             Long short-term memory neural networks or LSTM can be of a great help here and they are widely used in natural processing where the context is much more crucial.

Is worth to mention that VGG-16 last convolution layer provided quite good result and mixing that with intersection over union it actually offers quite similar results as CIFAR-10 trained only with cars. So it maybe worth it to train maybe VGG-16 only with cars and see of the embeddings generalize well.

During this blog we did not tune lambda parameter of Center Loss but rather fix it to 0.0001 so it maybe worth try other values as the original paper showed that actually different values have great impact on the generalization.

Although intuitively looks like more data it would help, we have seen that having just some data it didn’t help(like rotating images). So it maybe worth it to train with more data but only from videos or other ways which are close to what neural network is doing during tracking.

Found useful , feel free to share