Choosing the Right Model for Object Detection

Samuel Brice

5 min readNov 15, 2020

Demystifying Clearview AI Blog Series (Part 3)

Previous: Collecting Data for Deep Learning Development

Next: Running Object Detection and Structuring Data

Design diagrams for Faster R-CNN, YOLO, and SSD model types.

Faster R-CNNs, YOLO, and Single Shot Detectors

Object recognition describes a collection of three related computer vision tasks:

Image classification,
Object localization, and
Object detection.

Object detection effectively combines localized classifiers on one or more objects in an image.

When it comes to Deep Learning-based object detection, the three primary object detection model types are:

Faster Region-based Convolutional Neural Networks (Faster R-CNNs),
You Only Look Once (YOLO), and
Single Shot Detectors (SSDs).

Each model type includes trade-offs in complexity, speed, accuracy, and efficiency.

Faster R-CNNs

Faster R-CNNs are the most famous type of object detection models. The term “convolution” refers to the mathematical operation that allows for detecting local patterns of varying abstractions such as lines, curves, polygons, eyes, faces, heads, bodies, et cetera. Faster R-CNN models are complex and challenging to understand. Despite being “faster” than standard R-CNN, the algorithm is slow, on the order of 7 frames per second (FPS), relative to competing model families.

YOLO

YOLO models tend to be the fastest available, capable of processing 40–90 FPS or 155 FPS on specialized configurations. What YOLO overcompensates for with speed it lacks in terms of accuracy.

SSDs

SSDs strike a balance between Faster R-CNNs and YOLO. Based on a simpler algorithm, SSDs can generate detection much faster than R-CNNS at the rate of 22–46 FPS. While competitive with YOLO speed wise on the lower-end SSDs tend to be generally more accurate than YOLO.

Single-shot prediction for both classification and location. (source)

Choosing the Right Model

When it comes to choosing the right model for object recognition, in addition to some of the quantitative metrics described above, there exist various qualitative considerations to be made based on the nature of the desired application and available data.

For example, SSDs tend to have issues in detecting objects that are too close or too small. Based on findings from a Google Research paper on speed/accuracy trade-offs for modern convolutional object detectors, it’s well established that input image resolution impacts accuracy and speed significantly. Reducing image size by half in width and height on average lowers accuracy by 16% while lowering inferences time by 27%.

In fine-tuning a model, there are techniques such as Feature Pyramid Networks (FPN), a component built explicitly into YOLOv3 that can generally improve any model’s performance for small objects.

Picking the right model for the task at hand is a trade-off between speed, accuracy, and efficiency.

Overall mAP and GPU Time based on Meta Architecture. (source)

Using ImageAI, RetinaNet, and ResNet-50 with COCO Weights

In facial recognition, there’s a wide range of models and library options available. In addition to detecting faces within an image, there are libraries available to identify perceived emotions or verify users by face (a la FaceID). Some cloud-based services for implementing facial recognition include Microsoft Azure’s Face API and Amazon’s Rekognition API.

For simplicity’s sake and to dispel any semblance of rigorous analysis, as part of the CCTView demo, I opted to use one of the default models provided by ImageAI, a high-level object detection library. ImageAI is a Python library developed by the DeepQuest AI team. The library dramatically simplifies implementing object detection with support for RetinaNet, YOLOv2, and TinyYOLOv3 type models.

The ImageAI RetinaNet type model presents itself as the best choice for our application, given the size and resolution of images from the DOT’s webcam feed. Specifically, we’ll use a Residual Network architecture with 50 layers (ResNet-50) pre-trained on the Common Object in Context dataset (COCO).

There’s a lot to be said about the ResNet architecture’s effectiveness and ingenuity compared to other RetinaNet architectures…the below image is based on a roughly 2,229-word article.

An Overview of ResNet and its Variants (source)

As promised in the ImageAI commercials implementing object detection proves incredibly simple:

Running detection on individual webcam frames with ImagineAI took on average 2 seconds using a stock MacBook Pro with 2.7 GHz Quad-Core Intel Core i7, 16 GB of RAM, and an Intel Iris Plus Graphics 655 1536 MB. This performance was in concurrence with various background processes.

Below is the result of running object detection on a single camera frame.

As shown above, ImageAI can successfully detect vehicles at various distances from the camera. While the version of ImageAI used for this CCTView demonstration doesn’t specifically support face detection, there’s a broad range of comparative libraries such as Dlib, OpenCV, and PIL with very similar interfaces and the same type of functionality specifically for faces.

Building on this vehicle detection capability, the next step will be extracting the detected vehicles, then using an image recognition model to distinguish between different cars. But first, let’s talk about scaling object detection in the cloud.