Object (computer science)Wikipedia
In computer science, an object can be a variable, a data structure, a function, or a method, and as such, is a value in memory referenced by an identifier.
I set out to create a mobile responsive identity verification application that would exist in the browser vs. the common native iOS and Android SDK approach. I wanted the technology to both be technically elegant in its design and simple to implement from a non-technical perspective. The important idea was to be able to capture, digitize, process, and “objectify” user’s identity in as real time as possible.
Stripe generates a series of data objects. An example being the ‘charge object’ or ‘customer object’ , describing important instructions about the transaction.
My goal was to instantaneously extract data from an identity document to generate an “identity object” which could be appended to an existing charge object related to the transaction.
The Challenge — Image Acquisition in the Wild
The biggest challenge with my browser-based approach is that it removes the ability to control the quality of media capture. With a native iOS or Android app the developer retains granular control over the user’s camera settings.
The app UI can be utilized to guide the user to position their identity document within a predefined rectangular “guard-rail”, where the image resolution can be previewed and optimized before the image capture occurs.
Acquiring media via the web browser limits the user interface’s ability to extract similarly exacting images due to the reliance on much of the browser’s and device hardware’s default behavior. The final image captured through this method may vary in terms of quality, orientation, light exposure size etc.
Optimized for OCR
Optical Character Recognition (OCR) would invariably play a part in the identity card data extraction and digitalization. OCR does not work well when text is blurry or tilted.
The first hurdle would be an automated way to get a clean crop of the identity document within near to real time.
Basic Computer Vision
I explored a number of Computer Vision libraries to solve this problem. I experimented with various methods to isolate the ID card in the image, using techniques such as Canny edge and contour detection.
Theory vs Reality
In reality many of these techniques work well with simple examples but fail when real world scenarios are introduced such as noisy backgrounds and changes in lighting. Iterating on different algorithms and testing various thresholds manually improved results or finding the ‘Region of Interest’ in some images, but needed to be constantly recalibrated depending on the image quality, exposure, background, and various other factors.
Applying Computer Vision + Deep Neural Networks
To automate the digitalization of ID documents in the wild, where the number of combinations (particularly backgrounds) could be unlimited, a straightforward algorithmic approach wouldn’t work. I needed a system that would learn how to recognize the four quadratic points of an a ID card, regardless of the conditions or situation.
To “recognize” an identity card’s Region of Interest indeed implied some sort of sophisticated machine learning. I started researching the various machine learning models that could be applied to this problem.
Some of the models I explored are being used at the cutting edge of artificial intelligence. For example, Holistically-Nested Edge Detection, Semantic Segmentation , and U-Net.
Holistically-Nested Edge Detection is an algorithm that uses a deep neural network capable of automatically determining the edge/object boundary of objects in images.
The basis of HED has been used as the foundation to build deeper learning applications. Semantic segmentation uses deep learning to classify and identify objects, and provides the training material for autonomous cars to learn and respond to their driving environment.
The U-Net architecture , which uses convolutional networks for biomedical image segmentation, has been used to create deep learning models for segmenting nerves in ultrasound images, lungs in CT scans, and even interference in radio telescopes.
Learning about the U-Net model led me to an interesting implementation of a model trained on the MIDV-500 dataset.
Due to the private information inherent in an identity card, I needed to come up with some creative approaches to acquiring a sufficient dataset to train my model to recognize what an identity document actually looks like so it would be able to distinguish it from other random objects in the background. Thankfully there exists MIDV-500: A Dataset for Identity Documents Analysis and Recognition on Mobile Devices in Video Stream , a unique dataset that includes identity documents generated from 50 document types, including 17 types of ID cards, 14 types of passports, 13 types of driver’s licenses and more. The team at Smart Engines found a government template for each ID document , printed out each document, laminated the documents to simulate reflections, and shot 10 videos for each of them in 5 different conditions using the iPhone 5 and Samsung Galaxy S3 smartphones. These videos were then converted to 15,000 frames, encapsulating a wide range of the natural environment one may expect to be performing a KYC Check to sign up for a banking or crypto trading app.
I dug into the codebases of all of these different models and used my findings to inform an algorithm and to build the code that solved my own particular problem of semantic segmentation; the first of many challenges I would confront. My model was initially trained on Imagenet before applying to my custom ID document dataset. To train the model yourself check out my Github repo. To test out the model check out this repo plus my Jupyter notebook.