Choosing the Right Model for Object Detection
Demystifying Clearview AI Blog Series (Part 3)
Table of Contents
Previous: Collecting Data for Deep Learning Development
Next: Running Object Detection and Structuring Data
Faster R-CNNs, YOLO, and Single Shot Detectors
Object recognition describes a collection of three related computer vision tasks:
- Image classification,
- Object localization, and
- Object detection.
Object detection effectively combines localized classifiers on one or more objects in an image.
When it comes to Deep Learning-based object detection, the three primary object detection model types are:
- Faster Region-based Convolutional Neural Networks (Faster R-CNNs),
- You Only Look Once (YOLO), and
- Single Shot Detectors (SSDs).
Each model type includes trade-offs in complexity, speed, accuracy, and efficiency.
Faster R-CNNs
Faster R-CNNs are the most famous type of object detection models. The term “convolution” refers to the mathematical operation that allows for detecting local patterns of varying abstractions such as lines, curves, polygons, eyes, faces, heads, bodies, et cetera. Faster R-CNN models are complex and challenging to understand. Despite being “faster” than standard R-CNN, the algorithm is slow, on the order of 7 frames per second (FPS), relative to competing model families.
YOLO
YOLO models tend to be the fastest available, capable of processing 40–90 FPS or 155 FPS on specialized configurations. What YOLO overcompensates for with speed it lacks in terms of accuracy.
SSDs
SSDs strike a balance between Faster R-CNNs and YOLO. Based on a simpler algorithm, SSDs can generate detection much faster than R-CNNS at the rate of 22–46 FPS. While competitive with YOLO speed wise on the lower-end SSDs tend to be generally more accurate than YOLO.
Choosing the Right Model
When it comes to choosing the right model for object recognition, in addition to some of the quantitative metrics described above, there exist various qualitative considerations to be made based on the nature of the desired application and available data.
For example, SSDs tend to have issues in detecting objects that are too close or too small. Based on findings from a Google Research paper on speed/accuracy trade-offs for modern convolutional object detectors, it’s well established that input image resolution impacts accuracy and speed significantly. Reducing image size by half in width and height on average lowers accuracy by 16% while lowering inferences time by 27%.
In fine-tuning a model, there are techniques such as Feature Pyramid Networks (FPN), a component built explicitly into YOLOv3 that can generally improve any model’s performance for small objects.
Picking the right model for the task at hand is a trade-off between speed, accuracy, and efficiency.
Using ImageAI, RetinaNet, and ResNet-50 with COCO Weights
In facial recognition, there’s a wide range of models and library options available. In addition to detecting faces within an image, there are libraries available to identify perceived emotions or verify users by face (a la FaceID). Some cloud-based services for implementing facial recognition include Microsoft Azure’s Face API and Amazon’s Rekognition API.
For simplicity’s sake and to dispel any semblance of rigorous analysis, as part of the CCTView demo, I opted to use one of the default models provided by ImageAI, a high-level object detection library. ImageAI is a Python library developed by the DeepQuest AI team. The library dramatically simplifies implementing object detection with support for RetinaNet, YOLOv2, and TinyYOLOv3 type models.
The ImageAI RetinaNet type model presents itself as the best choice for our application, given the size and resolution of images from the DOT’s webcam feed. Specifically, we’ll use a Residual Network architecture with 50 layers (ResNet-50) pre-trained on the Common Object in Context dataset (COCO).
There’s a lot to be said about the ResNet architecture’s effectiveness and ingenuity compared to other RetinaNet architectures…the below image is based on a roughly 2,229-word article.
As promised in the ImageAI commercials implementing object detection proves incredibly simple:
Running detection on individual webcam frames with ImagineAI took on average 2 seconds using a stock MacBook Pro with 2.7 GHz Quad-Core Intel Core i7, 16 GB of RAM, and an Intel Iris Plus Graphics 655 1536 MB. This performance was in concurrence with various background processes.
Below is the result of running object detection on a single camera frame.
As shown above, ImageAI can successfully detect vehicles at various distances from the camera. While the version of ImageAI used for this CCTView demonstration doesn’t specifically support face detection, there’s a broad range of comparative libraries such as Dlib, OpenCV, and PIL with very similar interfaces and the same type of functionality specifically for faces.
Building on this vehicle detection capability, the next step will be extracting the detected vehicles, then using an image recognition model to distinguish between different cars. But first, let’s talk about scaling object detection in the cloud.
References
Common Objects in Context Dataset
Speed/accuracy trade-offs for modern convolutional object detectors
Open Source Computer Vision Library
A Gentle Introduction to Object Recognition With Deep Learning
What do we learn from single shot object detectors (SSD, YOLOv3), FPN & Focal loss (RetinaNet)?
Implementing YOLO using ResNet as Feature extractor
Train Object Detection AI with 6 lines of code
ResNet (34, 50, 101): Residual CNNs for Image Classification Tasks
YOLOv3: An Incremental Improvement
An Overview of ResNet and its Variants
Deep Learning Performance Cheat Sheet
Review: YOLOv3 — You Only Look Once (Object Detection)
Object detection with deep learning and OpenCV
Research on Pedestrian Detection Technology Based on MSR and Faster R-CNN