How I used Computer Vision & Deep Neural Networks to Modernize ID Verification
Object (computer science)
In computer science, an object can be a variable, a data structure, a function, or a method, and as such, is a value in memory referenced by an identifier.
Wikepedia
One of my main theses at the last company I founded Control — a payment analytics software platform — delved into both the rich and anonymous nature of transactional data. A transaction can contain an unlimited amount of deep metadata about a user’s identity ( name, address, birth date, etc.) but there lies a disconnect between the binary data a user may self disclose, and the multidimensionality of the actual, legitimate identity of the user behind a particular transaction.

If this disconnect could be resolved, certainly there would be a use-case for using identity augmented transactional data to reduce incidents of online fraud. The benefits realized from reducing payment reversals and shipping losses are obvious. This same logic could be extended to less likely use-cases, such as the exchanges and transactions that happen on sites such as Craigslist as precursors to meeting in person. A ubiquitous and easy identity verification system embedded as a prerequisite to both online and offline transactions could promote security and safety across unlimited scenarios.
The important idea was to be able to capture, digitize, process, and “objectify” identity in as real-time as possible.
I set out to develop a simple onboarding experience that would embed identity verification into any checkout or registration process, and could easily be extended to any use case. I wanted the technology to both be technically elegant in its design and simple to implement from a non-technical perspective. The important idea was to be able to capture, digitize, process, and “objectify” the user’s identity in as real-time as possible.
The Knuckle-busting Early Days of Payment Processing
A similar analogy around what I was trying to achieve could best be compared to the leap the credit card industry made went it converted from analog to digital. Some of us may remember the infancy of the payments industry when credit cards were exclusively used at the point of sale in real-world locations and “Amazon” was still most famous as a river in South America. The store clerk would write the amount and details of the purchase on a piece of three-plied carbon paper, where an imprint of the customers’ name and card numbers were taken by machines commonly known as “knuckle-busters”.

Every day at the close of business, the store manager had the responsibility of calling in (with a rotary phone!) to a card brand‘s (Visa, American Express, etc.) back-office support rep to verbally read out each 16 digit card number from the carbon copies.

This was true offline processing before the inception of the internet. Invariably someone on the other end was adjusting a ledger record or punching data into some mainframe computer equivalent. At that point (long after the initial point of sale) the merchant would be advised whether the transaction was authorized.

The best-case scenario was that the honor system prevailed and the customer had sufficient funds to cover the charge.
The worst outcome could be that a Frank Abagnale type character had already skipped town and was laughing “Catch Me If You Can!” from the jump seat of a Pan Am flight bound for elsewhere.
The Stripe Blueprint
The best blueprint for what I was trying to achieve was that of payments wonder-kid Stripe.
Consider a payment processed online through Stripe: once an intent to pay is submitted on a merchant’s eCommerce site, the transaction is circulated through the various interrelated authentication and verification systems (Radar fraud checks, Address Verification Service(AVS), interchange network, acquiring banks, card issuers, etc.). The responses are intercepted in nanoseconds, each API call adding a new set of information to that transaction. In virtually real-time, Stripe generates a series of data objects. An example being the ‘charge object’ or ‘customer object’, describing important instructions about the transaction.

Much like Stripe, I wanted to think beyond the paradigms of the established incumbents that Stripe had so craftily upended.
In the existing identity verification space dominated by the likes of Jumio and OnFido, the vast majority of their solutions are dependent on integrating iOS and Android software development kits (SDKs). This is, of course, dependent on the merchant having an iOs and/or Android app in the first place. This dependency stems from the multitude of challenges that arise when trying to acquire the correct media sources and image quality to perform ID verification.
Common thinking may suggest the merchant should buck up and build a native iOS or Android mobile app that forces users to continue through the app to verify their ID; problem solved right? Well, imagine how limited the explosion of eCommerce could have been if Jeff Bezos had first forced users to download the Amazon app before they were allowed to shop. Nowadays it seems preposterous to have an app downloaded through Google Play or the Apple App store for each site we interact with when Chrome does such a better/faster job. Not to mention all the search engine optimization power for the merchant that exists outside the walled gardens of the app store.
My goal was to instantaneously extract data from an identity document to generate an “identity object” … persisting or expiring as required.
Modern payments unencumbered merchants from the primitive knuckle-buster. I wanted to liberate ID verification to the ubiquity of the internet. I wanted to invent a 100% browser-based solution and remove the requirement that merchants build a native app before being capable of facilitating onboarding and verification of their end-users. My goal was to instantaneously extract data from an identity document to generate an “identity object” which could be appended to an existing charge object related to the transaction.

The identity object could persist along with a related, unique customer object. Or it could expire with each transaction, requiring the user to re-verify their identity each time they transact. With an eye on optimizing the speed and UX of the ID verification process as Stripe did with their revolutionary Stripe Checkout, I believe identity verification should not take any longer than the process of filling in the billing and payment details within the same user’s multi-step process.
The Challenge — Image Acquisition in the Wild
The biggest challenge with my browser-based approach is that it removes the ability to control the quality of media capture. With a native iOS or Android app, the developer retains granular control over the User Experience and importantly, control of the native camera settings. The UI can be utilized to assist and guide the user to position their identity document within a predefined rectangular “guard-rail”, where the image resolution can be previewed and optimized before the image capture occurs.

Acquiring media via the web browser limits the user interface’s ability to extract similarly exacting images due to the reliance on much of the browser’s and device hardware’s default behavior. The final image captured through this method may vary in terms of quality, orientation, light exposure size, etc.

My research indicated that Optical Character Recognition (OCR) would invariably play a part in the identity card data extraction and digitalization. OCR does not work well when text is blurry or tilted. I will discuss the OCR sequence more in later parts but suffice to say now the first hurdle would be an automated way to get a clean crop of the identity document within near to real-time.

There are several Computer Vision Python libraries such as OpenCV that I explored to solve this problem. I experimented with various methods to isolate the ID card in the image, using techniques such as Canny edge and contour detection.

In reality, many of these techniques work well with simple examples but fail when real-world scenarios are introduced such as noisy backgrounds and changes in lighting.

Iterating on different algorithms and testing various thresholds manually improved results or finding the ‘Region of Interest’ in some images, but needed to be constantly recalibrated depending on the image quality, exposure, background, and various other factors.
Enter Computer Vision + Deep Neural Networks
To “recognize” an identity card’s Region of Interest indeed implied some sort of sophisticated machine learning.
To automate the digitalization of ID documents in the wild, where the number of combinations (particularly backgrounds) could be unlimited, a straightforward algorithmic approach wouldn’t work. I needed a system that would learn how to recognize the four quadratic points of an ID card, regardless of the conditions or situation.
To “recognize” an identity card’s Region of Interest indeed implied some sort of sophisticated machine learning. I started researching the various machine learning models that could be applied to this problem. Due to the private information inherent in an identity card, I also needed to come up with some creative approaches to acquiring a sufficient dataset to train my model.
Some of the models I explored are being used at the cutting edge of artificial intelligence. For example, Holistically-Nested Edge Detection (HED) is an algorithm that uses a deep neural network capable of automatically determining the edge/object boundary of objects in images.

The basis of HED has been used as the foundation to build deeper learning applications. Semantic segmentation uses deep learning to classify and identify objects and provides the training material for autonomous cars to learn and respond to their driving environment.
The U-Net architecture, which uses convolutional networks for biomedical image segmentation, has been used to create deep learning models for segmenting nerves in ultrasound images, lungs in CT scans, and even interference in radio telescopes.

I dug into the codebases of all of these different models and used my findings to inform an algorithm and to build the code that solved my own particular problem of document extraction. My model was initially trained on Imagenet before applying to my custom ID document dataset. To train the model yourself check out my Github repo. To test out the model check out this repo plus my Jupyter notebook.