Tracking Vehicles with Deep Learning
Demystifying the Clearview AI Blog Series (Part 5)
Face Identification and Re-Identification
When a Deep Learning model processes an image, it first converts the image into a numerical format colloquially known as a vector embedding.
Mathematically speaking, a vector embedding can be described as a multidimensional number that can be sorted, compared, and manipulated. Using vector embedding, a Deep Learning model can calculate how similar two different face images are. Based on a given criterion, a model can determine that two very different face images are in-fact of the same person. Such specific person face identification, also known as instance re-identification, comes typically at the tail end of a much longer pipeline that starts from object detection.
FaceNet most famously introduced vector embedding for face recognition in 2015. Leveraging a “triplet loss” function, FaceNet achieved a record accuracy of 99.63% when identifying and re-identifying faces from the Labeled Faces in the Wild (LFW) dataset.
While FaceNet isn’t publicly available, the open-source library OpenFace is based on the same method, making it simple and easy to achieve similar state-of-the-art accuracy and performance. A personal recommendation for a barebones Python facial recognition library is the open-source Face Recognition published by Adam Geitgey.
Any one of the tools above is all it would take to deliver the type of “Face Search” functionality we’ve seen implemented in various demonstrations of Clearview’s web application.
For vehicles specifically, similar algorithms and libraries such as VehicleNet and RepNet leverage the same type of “triplet loss” and vector embeddings techniques to enable very efficient VehicleID and ReID functionality.
Large Scale Urban Vehicle Identification and Re-Identification
While single vehicle identification is a challenge parallel to single face identification as described above, the task of large scale urban vehicle identification and re-identification (ReID) encompasses a much broader set of complexities. You can see a hint of the difficulties involved from the map below based on the coverage blind spots illustrated within the New York City Department of Transportation CCTV network.
In addition to incomplete coverage and visibility, there are complicated domain-specific Deep Learning modeling challenges in large scale urban vehicle tracking and ReID scenarios. Unlike with faces, where there is some underlying expectation of uniqueness, vehicles are generally identical. Whereas the best means of uniquely identifying a car would be based on the license plate, it’s not always an option due to privacy reasons. With low-resolution images, that’s not always possible.
With all those inherent difficulties, multi-target multi-camera (MTMC) tracking is a well-known problem that is actively benchmarked by the AI City Challenge. State-of-the-art approaches to MTMC tracking leverage a combination of different techniques, including convolutional neural networks, spatial-temporal constraints, generated tracklet synchronization, and weakly supervised data augmentation.
The four best-documented methods used in MTMC tracking include:
- attribute recognition and metric learning,
- detail aware attentive multi-view inference,
- GAN-based learning, and
- multi-modality data-based frameworks.
With this series’s scope in mind and given the complexities detailed above, we’ll be targeting a small subset of Manhattan traffic, specifically along the downtown stretch of FDR Drive.
Focusing on traffic along a linear stretch of road enables us to implement the same type of identification and ReID technology as featured in Clearview’s desktop and mobile applications.
FDR Drive Vehicle Identification and Re-Identification
FDR Drive (officially referred to as the Franklin D. Roosevelt East River Drive) is a ten-mile limited-access parkway on Manhattan’s east side. The parkway is mostly three lanes in both directions, with all commercial vehicles (including trucks) banned from all sections, except for a small junction near the south side.
The downtown to midtown stretch of FDR Drive starting from Grand Street up to East 53rd Street is ideal for easy vehicle tracking thanks to fairly good camera coverage in both directions and very few exits.
Below is a map of camera positions, including exit locations for both southbound and northbound FDR Drive traffic, starting from Grand Street up to E 53rd Street. The parkway exits are essential to keep in mind as they increase the chances that a vehicle “should not” reappear from one camera to the next. In some cases, the model may detect a car just as it exits the parkway.
What makes the above stretch of FDR Drive ideal for vector embedding based identification and ReID similar to what we’d implement for facial recognition is the consistency in camera orientation, illumination, and resolution. Within such a global camera context, a simple identification and ReID algorithm would track a vehicle’s movement across the parkway because it requires fewer adjustments for local camera environmental and contextual factors.
Vehicle Re-Identification Using Track-to-Track Distance
For CCTView vehicle re-identification, we’ll use snippets from a library designed initially for a ReID technique based on latent representation extraction from a track of vehicles. The details of the algorithm are beyond the scope of this series. At a high level, a track of cars is a set of images of the same vehicle recorded by the same or different cameras. Instead of comparing a car’s vector embeddings from a single frame, we compare embeddings across multiple tracks comprehensively.
For CCTView, the ReID model is used to extract vector embeddings for vehicles using the second-to-last layer of a CNN fine-tuned for vehicle recognition. The CNN is based on the DenseNet201 architecture trained using the large scale image dataset for vehicle re-identification in urban traffic surveillance (VeRI dataset). While the library supports track-to-track distance computation for simplicity’s sake, this demo only implements image to image comparisons based on Euclidean distance. Track-to-track comparison would provide the most accurate and useful large-scale urban vehicle ReID capability.
Whereas our object detection model from ImageAI utilized Google’s TensorFlow, our feature extraction pipeline is based on PyTorch, a similar deep learning framework developed by Facebook.
Below is a code snippet of the PyTorch DenseNet CNN module used:
In the next section, we’ll talk about combining the data, information, and libraries detailed above in designing a Deep Learning based application to achieve our desired functionality.
References
The Comprehensive Cars (CompCars) dataset
Understanding Open-Source Facial Recognition Through OpenFace
A Large-Scale Car Dataset for Fine-Grained Categorization and Verification
Vehicle Re-Identification using Track-to-track ranking of deep latent representation of vehicles
Building a Facial Recognition Pipeline with Deep Learning in Tensorflow
ELECTRICITY: An Efficient Multi-camera Vehicle Tracking System for Intelligent City
Going Beyond Real Data: A Robust Visual Representation for Vehicle Re-identification
VOC-ReID: Vehicle Re-identification based on Vehicle-Orientation-Camera
How to Train Triplet Networks with 100K Identities?
DeepFace vs Facenet for face recognition
Three convolutional neural network models for facial expression recognition in the wild
Vehicle Re-identification: Past, Present and Future
Deep Relative Distance Learning: Tell the Difference Between Similar Vehicles
Deep learning-based methods for person re-identification: A comprehensive review