Object tracking with dlib

This tutorial will teach you how to perform object tracking using dlib and Python. After reading today’s blog post you will be able to track objects in real-time video with dlib.

A couple months ago we discussed centroid tracking, a simple, yet effective method to (1) assign unique IDs to each object in an image and then (2) track each of the objects and associated IDs as they move around in a video stream.

The biggest downside to this object tracking algorithm is that a separate object detector has to be run on each and every input frame — in most situations, this behavior is undesirable as object detectors, including HOG + Linear SVM, Faster R-CNNs, and SSDs can be computationally expensive to run.

An alternative approach would be to:

  1. Perform object detection once (or once every N frames)

  2. And then apply a dedicated tracking algorithm that can keep tracking of the object as it moves in subsequent frames without having to perform object detection

Is such a method possible?

The answer is yes, and in particular, we can use dlib’s implementation of the correlation tracking algorithm.

In the remainder of today’s blog post, you will learn how to apply dlib’s correlation tracker to track an object in real-time in a video stream.

To learn more about dlib’s correlation tracker, just keep reading.

Object tracking with dlib

We’ll start off today’s tutorial with a brief discussion of dlib’s implementation of correlation-based object tracking.

From there I will show you how to utilize dlib’s object tracker in your own applications.

Finally, we’ll wrap up today by discussing some of the limitations and drawbacks of dlib’s object tracker.

What are correlation trackers?

The dlib correlation tracker implementation is based on Danelljan et al.’s 2014 paper, Accurate Scale Estimation for Robust Visual Tracking.

Their work, in turn, builds on the popular MOSSE tracker from Bolme et al.’s 2010 work, Visual Object Tracking using Adaptive Correlation Filters. While the MOSSE tracker works well for objects that are translated, it often fails for objects that change in scale.

The work of Danelljan et al. proposed utilizing a scale pyramid to accurately estimate the scale of an object after the optimal translation was found. This breakthrough allows us to track objects that change in both (1) translation and (2) scaling throughout a video stream — and furthermore, we can perform this tracking in real-time.

For a detailed review of the algorithm, please refer to the papers linked above.

Project structure

To see how this project is organized, simply use the tree command in your terminal:

  $ tree.├── input│ ├── cat.mp4│ └── race.mp4├── output│ ├── cat_output.avi│ └── race_output.avi├── mobilenet_ssd│ ├── MobileNetSSD_deploy.caffemodel│ └── MobileNetSSD_deploy.prototxt└── track_object.py 3 directories, 7 files

.

│ ├── cat.mp4

├── output

│ └── race_output.avi

│ ├── MobileNetSSD_deploy.caffemodel

└── track_object.py

3 directories, 7 files

We have three directories:

input/ : Contains input videos for object tracking.

output/ : Our processed videos. In the processed video, the tracked object is annotated with a box and label.

mobilenet_ssd/ : The Caffe CNN model files are contained within this directory.

Today we’ll be reviewing one Python script: track_object.py .

Implementing our dlib object tracker

Let’s go ahead and get started implementing our object tracker using dlib.

Open up track_object.py and insert the following code:

  # import the necessary packagesfrom imutils.video import FPSimport numpy as npimport argparseimport imutilsimport dlibimport cv2

from imutils.video import FPS

import argparse

import dlib

Here we import our required packages. Notably, we’re using dlib, imutils, and OpenCV.

From there, let’s parse our command line arguments:

91011121314151617181920212223 # construct the argument parse and parse the argumentsap = argparse.ArgumentParser()ap.add_argument(“-p”, “–prototxt”, required=True, help=”path to Caffe ‘deploy’ prototxt file”)ap.add_argument(“-m”, “–model”, required=True, help=”path to Caffe pre-trained model”)ap.add_argument(“-v”, “–video”, required=True, help=”path to input video file”)ap.add_argument(“-l”, “–label”, required=True, help=”class label we are interested in detecting + tracking”)ap.add_argument(“-o”, “–output”, type=str, help=”path to optional output video file”)ap.add_argument(“-c”, “–confidence”, type=float, default=0.2, help=”minimum probability to filter weak detections”)args = vars(ap.parse_args())

10

12

14

16

18

20

22

ap = argparse.ArgumentParser()

help=”path to Caffe ‘deploy’ prototxt file”)

help=”path to Caffe pre-trained model”)

help=”path to input video file”)

help=”class label we are interested in detecting + tracking”)

help=”path to optional output video file”)

help=”minimum probability to filter weak detections”)

Our script has four required command line arguments:

–prototxt : Our path to the Caffe deploy prototxt file.

–model : The path to the Caffe pre-trained model.

–video : The path to the input video file. Today’s script works with video files rather than your webcam (but you could easily change it to support a webcam stream).

–label : A class label that we are interested in detecting and tracking. Review the next code block for the available classes that this model supports.

And two optional ones:

–output : An optional path to an output video file if you’d like to save the results of the object tracker.

–confidence : With a default=0.2 , this is the minimum probability threshold and it allows us to filter weak detections from our Caffe object detector.

Let’s define the classes that this model supports and load our network from disk:

  # initialize the list of class labels MobileNet SSD was trained to# detectCLASSES = [“background”, “aeroplane”, “bicycle”, “bird”, “boat”, “bottle”, “bus”, “car”, “cat”, “chair”, “cow”, “diningtable”, “dog”, “horse”, “motorbike”, “person”, “pottedplant”, “sheep”, “sofa”, “train”, “tvmonitor”] # load our serialized model from diskprint(“[INFO] loading model…“)net = cv2.dnn.readNetFromCaffe(args[“prototxt”], args[“model”])

detect

“bottle”, “bus”, “car”, “cat”, “chair”, “cow”, “diningtable”,

“sofa”, “train”, “tvmonitor”]

load our serialized model from disk

net = cv2.dnn.readNetFromCaffe(args[“prototxt”], args[“model”])

We’ll be using a pre-trained MobileNet SSD to perform object detection in a single frame. From there the object location will be handed off to dlib’s correlation tracker for tracking throughout the remaining frames of the video.

The model included with the “Downloads”supports 20 object classes (plus 1 for the background class) on Lines 27-30.

Note:If you’re using a different Caffe model, you’ll need to redefine this CLASSES list. Similarly, don’t modify this list if you’re using the model included with today’s download. If you’re confused about how deep learning object detectors work, be sure to refer to this getting started guide.

Prior to looping over frames, we need to load our model into memory. This is handled on Line 34 where all that is required to load a Caffe model is the path to the prototxt and model files (both available in our command line args dictionary).

Now let’s perform important initializations, notably our video stream:

  # initialize the video stream, dlib correlation tracker, output video# writer, and predicted class labelprint(“[INFO] starting video stream…“)vs = cv2.VideoCapture(args[“video”])tracker = Nonewriter = Nonelabel = ““ # start the frames per second throughput estimatorfps = FPS().start()

writer, and predicted class label

vs = cv2.VideoCapture(args[“video”])

writer = None

 

fps = FPS().start()

Our video stream, tracker , and video writer objects are initialized on Lines 39-41. We also initialize our textual label on Line 42.

Our frames-per-second estimator is instantiated on Line 45.

Now we’re ready to begin looping over our video frames:

4748495051525354555657585960616263646566 # loop over frames from the video file streamwhile True: # grab the next frame from the video file (grabbed, frame) = vs.read()  # check to see if we have reached the end of the video file if frame is None: break  # resize the frame for faster processing and then convert the # frame from BGR to RGB ordering (dlib needs RGB ordering) frame = imutils.resize(frame, width=600) rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  # if we are supposed to be writing a video to disk, initialize # the writer if args[“output”] is not None and writer is None: fourcc = cv2.VideoWriter_fourcc(*“MJPG”) writer = cv2.VideoWriter(args[“output”], fourcc, 30, (frame.shape[1], frame.shape[0]), True)

48

50

52

54

56

58

60

62

64

66

while True:

(grabbed, frame) = vs.read()

# check to see if we have reached the end of the video file

break

# resize the frame for faster processing and then convert the

frame = imutils.resize(frame, width=600)

 

# the writer

fourcc = cv2.VideoWriter_fourcc(*“MJPG”)

(frame.shape[1], frame.shape[0]), True)

We begin our while loop on Line 48 and proceed to grab a frame on Line 50.

Our frame is resized and the color channels are swapped on Lines 58 and 59. Resizing allows for faster processing — you can experiment with the frame dimensions to achieve higher FPS. Converting to RGB color space is required by dlib (OpenCV stores images in BGR order by default).

Optionally, at runtime, an output video path can be passed via command line arguments. So, if necessary, we’ll initialize our video writer on Lines 63-66. For more information on writing video to disk with OpenCV, see this previous post.

Next, we’ll need to detect an object for tracking (if we haven’t already):

  # if our correlation object tracker is None we first need to # apply an object detector to seed the tracker with something # to actually track if tracker is None: # grab the frame dimensions and convert the frame to a blob (h, w) = frame.shape[:2] blob = cv2.dnn.blobFromImage(frame, 0.007843, (w, h), 127.5)  # pass the blob through the network and obtain the detections # and predictions net.setInput(blob) detections = net.forward()

# apply an object detector to seed the tracker with something

if tracker is None:

(h, w) = frame.shape[:2]

 

# and predictions

detections = net.forward()

If our tracker object is None (Line 71), we first need to detect objects in the input frame . To do so, we create a blob (Line 74) and pass it through the network (Lines 78 and 79).

Let’s handle the detections now:

81828384858687888990919293 # ensure at least one detection is made if len(detections) > 0: # find the index of the detection with the largest # probability – out of convenience we are only going # to track the first object we find with the largest # probability; future examples will demonstrate how to # detect and extract specific objects i = np.argmax(detections[0, 0, :, 2])  # grab the probability associated with the object along # with its class label conf = detections[0, 0, i, 2] label = CLASSES[int(detections[0, 0, i, 1])]

82

84

86

88

90

92

if len(detections) > 0:

# probability – out of convenience we are only going

# probability; future examples will demonstrate how to

i = np.argmax(detections[0, 0, :, 2])

# grab the probability associated with the object along

conf = detections[0, 0, i, 2]

If our object detector finds any objects (Line 82), we’ll grab the one with the largest probability (Line 88).

We’re only demonstrating how to use dlib to perform single object tracking in this post, so we need to find the detected object with the highest probability. Next week’s blog post will cover multi-object tracking with dlib.

From there, we’ll grab the confidence ( conf ) and label associated with the object (Lines 92 and 93).

Now it’s time to filter out the detections. Here we’re trying to ensure we have the right type of object which was passed by command line argument:

9596979899100101102103104105106107108109110111112113114 # filter out weak detections by requiring a minimum # confidence if conf > args[“confidence”] and label == args[“label”]: # compute the (x, y)-coordinates of the bounding box # for the object box = detections[0, 0, i, 3:7] * np.array([w, h, w, h]) (startX, startY, endX, endY) = box.astype(“int”)  # construct a dlib rectangle object from the bounding # box coordinates and then start the dlib correlation # tracker tracker = dlib.correlation_tracker() rect = dlib.rectangle(startX, startY, endX, endY) tracker.start_track(rgb, rect)  # draw the bounding box and text for the object cv2.rectangle(frame, (startX, startY), (endX, endY), (0, 255, 0), 2) cv2.putText(frame, label, (startX, startY - 15), cv2.FONT_HERSHEY_SIMPLEX, 0.45, (0, 255, 0), 2)

96

98

100

102

104

106

108

110

112

114

# confidence

# compute the (x, y)-coordinates of the bounding box

box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])

 

# box coordinates and then start the dlib correlation

tracker = dlib.correlation_tracker()

tracker.start_track(rgb, rect)

# draw the bounding box and text for the object

(0, 255, 0), 2)

cv2.FONT_HERSHEY_SIMPLEX, 0.45, (0, 255, 0), 2)

On Line 97 we check to ensure that conf exceeds the confidence threshold and that the object is actually the class type we’re looking for. When we run the script later, we’ll use “person” or “cat” as examples so you can see how we can filter results.

We determine bounding box coordinates of our object on Lines 100 and 101.

Then we establish our dlib object tracker and provide the bounding box coordinates (Lines 106-108). Future tracking updates will be easy from here on.

A bounding box rectangle and object class label text is drawn on the frame on Lines 111-114.

Let’s handle the case where we’ve already established a tracker :

116117118119120121122123124125126127128129130131132133134 # otherwise, we’ve already performed detection so let’s track # the object else: # update the tracker and grab the position of the tracked # object tracker.update(rgb) pos = tracker.get_position()  # unpack the position object startX = int(pos.left()) startY = int(pos.top()) endX = int(pos.right()) endY = int(pos.bottom())  # draw the bounding box from the correlation object tracker cv2.rectangle(frame, (startX, startY), (endX, endY), (0, 255, 0), 2) cv2.putText(frame, label, (startX, startY - 15), cv2.FONT_HERSHEY_SIMPLEX, 0.45, (0, 255, 0), 2)

117

119

121

123

125

127

129

131

133

# the object

# update the tracker and grab the position of the tracked

tracker.update(rgb)

 

startX = int(pos.left())

endX = int(pos.right())

 

cv2.rectangle(frame, (startX, startY), (endX, endY),

cv2.putText(frame, label, (startX, startY - 15),

This else block handles the case where we’ve already locked on to an object for tracking.

Think of it like a dogfight in the movie, Top Gun. Once the enemy aircraft has been locked on by the “guidance system”, it can be tracked via updates.

This requires two main actions on our part:

Update our tracker object (Line 121) — the heavy lifting is performed in the backend of this update method. Grab the position ( get_position ) of our object from the tracker (Line 122). This would be where a PID control loop would come in handy if, for example, a robot seeks to follow a tracked object. In our case, we’re just going to annotate the object in the frame with a bounding box and label on Lines 131-134.

Let’s finish out the loop:

136137138139140141142143144145146147148149 # check to see if we should write the frame to disk if writer is not None: writer.write(frame)  # show the output frame cv2.imshow(“Frame”, frame) key = cv2.waitKey(1) & 0xFF  # if the q key was pressed, break from the loop if key == ord(“q”): break  # update the FPS counter fps.update()

137

139

141

143

145

147

149

if writer is not None:

 

cv2.imshow(“Frame”, frame)

 

if key == ord(“q”):

 

fps.update()

If the frame should be written to video, we do so on Lines 137 and 138.

We’ll show the frame on the screen (Line 141).

If the quit key (“q”) is pressed at any point during playback + tracking, we’ll break out of the loop (Lines 142-146).

Our fps estimator is updated on Line 149.

Finally, let’s perform print out FPS throughput statistics and release pointers prior to the script exiting:

151152153154155156157158159160161162 # stop the timer and display FPS informationfps.stop()print(“[INFO] elapsed time: {:.2f}”.format(fps.elapsed()))print(“[INFO] approx. FPS: {:.2f}”.format(fps.fps())) # check to see if we need to release the video writer pointerif writer is not None: writer.release() # do a bit of cleanupcv2.destroyAllWindows()vs.release()

152

154

156

158

160

162

fps.stop()

print(“[INFO] approx. FPS: {:.2f}”.format(fps.fps()))

check to see if we need to release the video writer pointer

writer.release()

do a bit of cleanup

vs.release()

Housekeeping for our script includes:

Our fps counter is stopped and the FPS information is displayed in the terminal (Lines 152-154. Then, if we were writing to an output video, we release the video writer (Lines 157 and 158).

  • Lastly, we close all OpenCV windows and release the video stream (Lines 161 and 162).

Running dlib’s object tracker in real-time

To see our dlib object tracker in action, make sure you use the “Downloads� section of this blog post to download the source code.

From there, open up a terminal and execute the following command:

  $ python track_object.py –prototxt mobilenet_ssd/MobileNetSSD_deploy.prototxt \ –model mobilenet_ssd/MobileNetSSD_deploy.caffemodel –video input/race.mp4 \ –label person –output output/race_output.avi[INFO] loading model…[INFO] starting video stream…[INFO] elapsed time: 13.18[INFO] approx. FPS: 25.80

–model mobilenet_ssd/MobileNetSSD_deploy.caffemodel –video input/race.mp4 \

[INFO] loading model…

[INFO] elapsed time: 13.18

Usain Bolt (Olympic World Record holder) was detected initially with highest confidence at the beginning of the video. From there, he is tracked successfully throughout his 100m race.

The full video can be found below:

Below we have a second example of object tracking with dlib:

  $ python track_object.py –prototxt mobilenet_ssd/MobileNetSSD_deploy.prototxt \ –model mobilenet_ssd/MobileNetSSD_deploy.caffemodel –video input/cat.mp4 \ –label cat –output output/cat_output.avi[INFO] loading model…[INFO] starting video stream…[INFO] elapsed time: 6.76[INFO] approx. FPS: 24.12

–model mobilenet_ssd/MobileNetSSD_deploy.caffemodel –video input/cat.mp4 \

[INFO] loading model…

[INFO] elapsed time: 6.76

The cat above was part of a BuzzFeed segment cat owners trying to take their cats for a walk (as if they were dogs). Poor cats!

Drawbacks and potential improvements

If you watched the full output video of the demo above, you would have noticed the object tracker behaving strangely towards the end of the demo, as this GIF demonstrates.

So, what’s going on here?

Why is the tracker losing the object?

Keep in mind there is no such thing as a “perfectâ€� object tracker — and furthermore, this object tracking algorithm is not requiring you to run a more expensive object detector on each and every frame of the input image.

Instead, dlib’s correlation tracker is combining both (1) prior information regarding the location of the object bounding box in the previous frame along with (2) data garnered from the current frame to infer where the new location of the object is.

There will certainly be times when the algorithm loses the object.

To remedy this situation, I recommend occasionally running your more expensive object detector to (1) validate the object is still there and (2) reseed the object tracking with the updated (and ideally correct) bounding box coordinates. August’s blog post on people counting with OpenCV accomplished exactly this, so be sure to check it out.

What about multi-object tracking?

Undoubtedly, I know there will be PyImageSearch readers wishing to apply this method to multi-object tracking rather than single object tracking.

Is it possible to track multiple objects using dlib’s correlation tracker?

The answer is yes, absolutely!

I’ll be covering multi-object tracking next week, so stay tuned.

Video credits

To create the examples for this tutorial I needed to use clips from two different videos. A big thank you and credit to BuzzFeed Video and GERrevolt.

Summary

In today’s blog post we discussed dlib’s object tracking algorithm.

Unlike July’s tutorial on centroid tracking, dlib’s object tracking algorithm can update itself utilizing information garnered from the input RGB image — the algorithm does not require that a set of bounding boxes be computed for each and every frame in the input video stream.

As we found out, dlib’s correlation tracking algorithm is quite robust and capable of running in real-time.

However, the biggest drawback is that the correlation tracker can become “confused� and lose the object we wish to track if viewpoint changes substantially or if the object to be tracked becomes occluded.

In those scenarios we can re-run our (computationally expensive) object detector to re-determine the location of our tracked object — be sure to refer to this blog post on people counting for such an implementation.

In our next blog post we’ll be discussing multi-object tracking with dlib — to be notified when the next blog post goes live (and download the source code to today’s post), just enter your email address in the form below.

Downloads: