Tracking Vehicles with Deep Learning

Demystifying the Clearview AI Blog Series (Part 5)

7 min readNov 29, 2020

Vehicle Re-identification: Past, Present, and Future

Face Identification and Re-Identification

When a Deep Learning model processes an image, it first converts the image into a numerical format colloquially known as a vector embedding.

Mathematically speaking, a vector embedding can be described as a multidimensional number that can be sorted, compared, and manipulated. Using vector embedding, a Deep Learning model can calculate how similar two different face images are. Based on a given criterion, a model can determine that two very different face images are in-fact of the same person. Such specific person face identification, also known as instance re-identification, comes typically at the tail end of a much longer pipeline that starts from object detection.

FaceNet most famously introduced vector embedding for face recognition in 2015. Leveraging a “triplet loss” function, FaceNet achieved a record accuracy of 99.63% when identifying and re-identifying faces from the Labeled Faces in the Wild (LFW) dataset.

Thresholding the distance between two given embeddings (source)

While FaceNet isn’t publicly available, the open-source library OpenFace is based on the same method, making it simple and easy to achieve similar state-of-the-art accuracy and performance. A personal recommendation for a barebones Python facial recognition library is the open-source Face Recognition published by Adam Geitgey.

Clearview web application “Face Search” (source)

Any one of the tools above is all it would take to deliver the type of “Face Search” functionality we’ve seen implemented in various demonstrations of Clearview’s web application.

For vehicles specifically, similar algorithms and libraries such as VehicleNet and RepNet leverage the same type of “triplet loss” and vector embeddings techniques to enable very efficient VehicleID and ReID functionality.

Deep Relative Distance Learning: Tell the Difference Between Similar Vehicles (source)

Large Scale Urban Vehicle Identification and Re-Identification

While single vehicle identification is a challenge parallel to single face identification as described above, the task of large scale urban vehicle identification and re-identification (ReID) encompasses a much broader set of complexities. You can see a hint of the difficulties involved from the map below based on the coverage blind spots illustrated within the New York City Department of Transportation CCTV network.

In addition to incomplete coverage and visibility, there are complicated domain-specific Deep Learning modeling challenges in large scale urban vehicle tracking and ReID scenarios. Unlike with faces, where there is some underlying expectation of uniqueness, vehicles are generally identical. Whereas the best means of uniquely identifying a car would be based on the license plate, it’s not always an option due to privacy reasons. With low-resolution images, that’s not always possible.

With all those inherent difficulties, multi-target multi-camera (MTMC) tracking is a well-known problem that is actively benchmarked by the AI City Challenge. State-of-the-art approaches to MTMC tracking leverage a combination of different techniques, including convolutional neural networks, spatial-temporal constraints, generated tracklet synchronization, and weakly supervised data augmentation.

Metric learning, Multi-view inference, and GAN-based learning (source)

The four best-documented methods used in MTMC tracking include:

attribute recognition and metric learning,
detail aware attentive multi-view inference,
GAN-based learning, and
multi-modality data-based frameworks.

Architecture of a PROVID multi-modality framework (source)

With this series’s scope in mind and given the complexities detailed above, we’ll be targeting a small subset of Manhattan traffic, specifically along the downtown stretch of FDR Drive.

Focusing on traffic along a linear stretch of road enables us to implement the same type of identification and ReID technology as featured in Clearview’s desktop and mobile applications.

FDR Drive Vehicle Identification and Re-Identification

FDR Drive (officially referred to as the Franklin D. Roosevelt East River Drive) is a ten-mile limited-access parkway on Manhattan’s east side. The parkway is mostly three lanes in both directions, with all commercial vehicles (including trucks) banned from all sections, except for a small junction near the south side.

The downtown to midtown stretch of FDR Drive starting from Grand Street up to East 53rd Street is ideal for easy vehicle tracking thanks to fairly good camera coverage in both directions and very few exits.

FDR Drive exit list — Battery Park City to Sutton Place (source)

Below is a map of camera positions, including exit locations for both southbound and northbound FDR Drive traffic, starting from Grand Street up to E 53rd Street. The parkway exits are essential to keep in mind as they increase the chances that a vehicle “should not” reappear from one camera to the next. In some cases, the model may detect a car just as it exits the parkway.

What makes the above stretch of FDR Drive ideal for vector embedding based identification and ReID similar to what we’d implement for facial recognition is the consistency in camera orientation, illumination, and resolution. Within such a global camera context, a simple identification and ReID algorithm would track a vehicle’s movement across the parkway because it requires fewer adjustments for local camera environmental and contextual factors.

Vehicle Re-Identification Using Track-to-Track Distance

For CCTView vehicle re-identification, we’ll use snippets from a library designed initially for a ReID technique based on latent representation extraction from a track of vehicles. The details of the algorithm are beyond the scope of this series. At a high level, a track of cars is a set of images of the same vehicle recorded by the same or different cameras. Instead of comparing a car’s vector embeddings from a single frame, we compare embeddings across multiple tracks comprehensively.

Latent representation extraction and I2T/T2T Ranking procedure (source)

For CCTView, the ReID model is used to extract vector embeddings for vehicles using the second-to-last layer of a CNN fine-tuned for vehicle recognition. The CNN is based on the DenseNet201 architecture trained using the large scale image dataset for vehicle re-identification in urban traffic surveillance (VeRI dataset). While the library supports track-to-track distance computation for simplicity’s sake, this demo only implements image to image comparisons based on Euclidean distance. Track-to-track comparison would provide the most accurate and useful large-scale urban vehicle ReID capability.

Whereas our object detection model from ImageAI utilized Google’s TensorFlow, our feature extraction pipeline is based on PyTorch, a similar deep learning framework developed by Facebook.

Below is a code snippet of the PyTorch DenseNet CNN module used:

In the next section, we’ll talk about combining the data, information, and libraries detailed above in designing a Deep Learning based application to achieve our desired functionality.