Designing a Deep Learning App

6 min readDec 2, 2020

Demystifying Clearview AI Blog Series (Part 6)

The Application Development Lifecycle

We started our deep learning development lifecycle by streaming data from the NYC DOT public API and deploying a cloud-based pipeline to scale the process. With our raw data in hand, we performed some simple cleaning and restructuring then evaluated the performance of various object detection models for identifying vehicles from our captured CCTV feeds. With the scope of this demonstration in mind, we’ll be skipping the remaining steps of the deep learning development lifecycle and instead transition to a standard application development lifecycle.

Comparing the deep learning development lifecycle above to the application development lifecycle shown below, notice how the terms are different, but in reality, the process is broadly the same. What’s fundamentally different is the type of expertise you’d bring to bear reevaluating the design of a deep learning model versus rethinking the design of a web application based on real-world performance and feedback.

Depending on the integration level across teams, both cycles can happen in parallel, in sync, or intermingled, with the appropriate step of the deep learning development lifecycle feeding into a proper stage of the application development lifecycle and vice versa.

To a large extent, we’ve already planned out and analyzed the application to be developed — vehicle tracking using public CCTVs. We’ll be diving specifically into the design process. It will enable us to understand the limitations inherent with deep learning and some of the patterns generally employed. Those patterns will enlighten how Clearview may have had to structure their systems and provide valuable insights as we discuss how you can protect your privacy from facial recognition technology.

FDR Drive Tracking Information Architecture Overview

What makes our selected stretch of FDR Drive ideal for vehicle tracking is the consistency in camera orientation, illumination, and resolution at the different locations across the parkway. Within such a context, a simple identification and ReID algorithm works well in tracking the movement of a vehicle because it requires fewer adjustments for environmental and camera contextual factors.

As mentioned earlier, parkway exits are essential to keep in mind as they increase the chances that a vehicle “should not” reappear from one camera to the next. It’s also important to keep in mind the distance between cameras. It gives a general idea of when a vehicle should be expected within frame, assuming it has not exited the parkway. A large-scale urban tracking system would ideally automatically infer such information based on GPS and other data. For this demonstration, the information has been manually compiled and outlined in the below network chart.

The north-facing cameras labeled “1 Ave @ 42 St” and “WBB — 10 NIR-4 @ FDR Rd” have been excluded because their orientation doesn’t adequately capture southbound traffic. Likewise, the camera “FDR Dr @ 10 St” has been excluded for the same reason regarding northbound traffic. Overall this makes for five cameras for tracking vehicle traffic along a roughly three-mile stretch of road. With a posted speed limit of 40 miles per hour, we can expect a vehicle to enter and leave this zone (end to end) in about 4 minutes.

Tracker Implementation

Our detection model can identify multiple cars within a single frame. By combining that capability with our ReID model, we’ll enable tracking a specific vehicle across numerous cameras.

To start tracking a vehicle, the user first selects the target vehicle from an initial anchoring frame. Using the individual object detection image extracted by ImageAI, the application will generate a vector embedding for the chosen car then use that vector embedding to ReID the vehicle in downstream frames and cameras.

For subsequent cameras along the vehicle’s expected path, the application will calculate the embedding for all detected vehicles then compare those embeddings with the initial embedding to re-identify the target vehicle.

If a vehicle cannot be re-identified within an expected time frame, it’s possible the detector failed to capture the car adequately, or the car has exited the parkway. The threshold for how long to wait before determining that a vehicle has left the parkway or the level of confidence required before establishing that a given car is the correct one can be adjusted on the fly.

By implementing this detection and re-identification pattern frame by frame across various cameras, we can effectively track a vehicle’s progress along the parkway. This elementary set of functionality is a building block towards more complex tracking applications.

For a more comprehensive introduction to the state-of-the-art city-scale multi-target multi-camera vehicle tracking and re-identification, checkout CityFlow and the AI City Challenge.

Demo Application Architecture Overview

The complete design and development process for the CCTView demo app is out of scope for this series. However, I will share some sketches from the design process and also explain how decisions made in Part 4 regarding data structures greatly facilitated development.

As illustrated in the “User Interface” action flow above, the application is centered around: (1) camera frames and (2) object detections. With all contextual information regarding an object already included in the object’s uniquely identifiable ID, exchanges between the client and the server are very terse. This brevity keeps the client-server architecture and APIs lean and straightforward, as can be seen below.

An additional benefit of the information symmetry between the client and server is the ability to efficiently distribute the identity, re-identify, and match workload across both boundaries. The server is only responsible for sending frames to the client and distance metrics, indicating how closely each car within the frame matches the car dynamically selected by the user. The client uses that information along with real-time settings from the user to determine an appropriate match. Because actual matches can be determined on the client, the tracker’s settings can be adjusted freely by the user, and the browser would reflect the results immediately.

With frame metadata already structured into minute-level summaries, it’s possible to give a comprehensive timeline of detections summaries from a camera with a small number of requests from the server. You can see the use of that data structure and API within the “Detections Timeline” chart component below.

Using preprocessed object detections and precomputed vector embeddings made the LEAN Stack a suitable choice to develop and deploy the application. The entire application’s deployment size is less than 20 MB (15 MB for the client, 4 MB for the server); however, backed by an asynchronous Node.js server, it can seamlessly and quickly process over 10 GB of data from an external webstore.

Deployment and redeployment on Heroku takes minutes, and running the demo is free. With the precomputed vector embeddings stored in a language-agnostic data structure, it was possible to use native Node.js to calculate Minimal Euclidean Distance and effectively run the ranking component of the ReID model without needing to deploy a separate Python environment.