A solution for major government identification initiatives

Prepared by Dr. Ben Bavarian, Principal Consultant at ABC Inc. AFIS and Biometrics Industry Experts

Process terabytes of biometrics data rapidly using clusters of hundreds to thousands of nodes. Provide the large scale cloud based Biometrics Identification System solutions Biometrics Identification on SAAS (Software As A Service)

It used to be that building large scale biometric systems and testing new algorithms involved a large group of computers, ad-hoc testing scripts, and a group of system administrators trying to maintain the whole assembly. With the advent in on-demand cloud based computing services and frameworks, ABC can now scale testing resources as needed, in a reliable, redundant and robust fashion. Thousands of CPU’s can be allocated as needed for the test in order to meet throughput or latency requirements. Multi-terabyte databases are no problem. Processing the statistics for trillions of match scores is easily done.

What’s so difficult about biometric testing?

Quite simply, there is a lot of data to manage and handle reliably. Tests need to be fully checkpointed, reliable, and redundant. A single hard disk or computer failure could doom an entire test, so the framework that runs the tests needs to be robust and automatically handle the problems.

The Basic Problem: too much data, too much time

  • Training biometric algorithms requires large training sets. But…
  • Score results are quadratic in growth rate: N cases gives N*N results.
  • Multiple matching algorithms can be used as input, so again, more data
  • Multiple feature extraction algorithms each produce their own feature data from input images.
  • The more permutations are involved, the more CPU time and storage is needed.

How Does ABC’s Approach Solve the Problem?

Rather than using a conventional group of networked machines, ABC’s approach borrows the technology used by Google, Yahoo, and Amazon.com for search processing. Data and processing are distributed throughout a network of provisioned on-demand computers.

An advanced scheduling system based on Apache Hadoop breaks up the calculations into small jobs that are scheduled throughout a loosely coupled network. A highly reliable filesystem replicates data in at least three places in the network, providing reliability in the face of a disk or network failure. Data replication not only helps with reliability, but increases data read throughput; important for feature extraction and matching.

A column oriented database is used to store data in a model similar to a NIST record. However, instead of being concentrated in one place in the network (one NIST file = one record), the data within each record is distributed among many different network nodes. Processing nodes that need image data for feature extraction automatically get copies of the image data as the biometric records are added to the system. The network wide distributed storage in the filesystem and column-oriented database means that biometric record databases can scale from 1 to 100 million records with ease.

ABC’s pricing structure is based solely on the CPU time used, the data storage needed, and the amount of data transferred in or out of the network. Data transfers in and out of the network use encrypted links. Data is decrypted on the fly during feature extraction, matching, and other operations.

How much data?

  • Consider developing a multi-biometric fusion algorithm.
  • Assume for testing that you have 10,000 subjects each with 10 finger captures, 2 iris captures, and a single face capture.
  • Cross match of these subjects generate about 1.3 billion scores.
  • How do you feed 1.3 billion scores to a fusion algorithm, and then calculate ROC?
  • What about multiple times, for multiple tests, with multiple algorithms.

Other Problems

Computers and networks fail. The testing system has to be robust and should reschedule failed work as necessary.
Purchasing computers for a given test doesn’t make economic sense. Computer utilization is low between tests. Computing and storage for testing needs to be on-demand.

Internet Search Engine Technology

Amazon, Google, Yahoo, Facebook all process terabytes of data daily using clusters of hundreds to thousands of nodes.
These companies don’t use traditional relational databases or processing nodes. Instead, they use clustered groups of commodity computers. This is the back end of so-called “Cloud Computing”
Failures of nodes, disks, and networks are assumed.
Databases are distributed, flat, minimally indexed, and optimized for speed.
Control frameworks ensure processing occurs where the data is.

ABC Test Framework and Hadoop

ABC Inc. has developed a biometrics system testing architecture around the open source software Hadoop. http://hadoop.apache.org/, open-source software for reliable, scalable, distributed computing. The components and their role in the ABC testing framework are:

  • MapReduce: Orchestrate large scale algorithm testing
  • Hadoop Filesystem (HDFS): Reliable redundant storage across
  • HBase: Record oriented store of case data and score results
  • Pig: Data flow language for processing
  • Hive: Relational database technology that is used for advanced SQL type calculations
  • Zookeeper: Manages the system, coordinates the cluster operations

Cloud Security and Privacy

One of the best ways to help you understand the cloud security environment is for cloud service providers to develop a common way to disclose relevant practices, principles and capabilities using a common framework. Cloud providers and customers can create a governance framework by leveraging the existing ISO 27001 and ISO 27002 standards4 to provide an approach that can naturally be applied in a cloud environment.

Together, ISO 27001 and 27002 provide requirements for creating controls to implement and use security best practices. These practices are specific enough that they can be audited to provide assurance that security design, technologies, and procedures are indeed being implemented in conformance with the standards. More importantly, they provide a common framework for discussing, analyzing and planning how best to leverage cloud offerings to meet business and risk requirements for both the cloud provider and the cloud customer.

ABC writes tests in high level languages (Java, C++), scripting (Pig Latin), or query language (Hive QL). DLL’s and custom libraries can be distributed throughout our network for testing with ease.