Here at WattzOn we use machine learning (ML) to tackle the challenge of automated data extraction from documents. Our machine learning system, GLYNT, has several components, including an Elastic AI Workbench that provides pipelining of documents and data flows, as well a ML algorithm orchestration. The Elastic AI Workbench makes GLYNT a self-contained system, ready for customer deployments and supports the production-scale delivery of results.
In most blog posts and technical notes in the AI or ML fields, almost all of the focus is on ML algorithm selection and refinement. What is neglected is how to drive ML results to scale. Here is a bit of the back story on what we built.
The secret sauce of GLYNT is the way we structured the problem, making it amenable to a machine learning approach. (Heads up: We filed a patent.) At a high level Mr Bill consists of two basic systems:
1. A sophisticated ensemble of machine learning algorithms
2. A pipelining and orchestration system that trains, tunes and runs Mr Bill
THE CORE GLYNT
The core GLYNT system ingests either PDFs, scans, faxes or images of documents. Documents are pre-processed as needed through one or more OCR engines. All documents then go through the ML data extraction library. The ML suite allows for a combination of both statistical machine learning ensembles and arbitrarily architected neural net type solutions — independently or as part of an ensemble. In addition, the two types of solutions can help each other via embeddings that broker information between them , or as different results in a larger ensemble. The architecture is not restricted.
The core GLYNT engine begins operation by training the system to extract particular fields from documents. The trainer simply identifies which fields they want extracted. This is typical data marking, or creation of ground truth. For example, if the desired field is “Amount Due”, a handful of sample documents will be marked with that field. WattzOn has built an easy-to-use data marking tool for quick set up of training data, and with features that increase accuracy of results.
GLYNT addresses two challenges that plague traditional OCR and ad hoc code solutions with ever-growing data extraction costs: Changes in the positions of data fields on a page, and new data fields appearing over time. GLYNT was designed to be quite resilient to changes in presentations — most layout variations don’t have the slightest effect on GLYNT’’s ability to accurately extract data. This feature is leveraged across all document with our single ML system.
PIPELINE AND ORCHESTRATION
As we rolled out GLYNT, we soon realized that there was no off-the-shelf pipelining and orchestration system to handle our large, complex ensemble solution. We scoured the market, but in the end found that we had to create our own solution. Since we knew we would be building more ensemble applications, we developed a general production solution hosted on AWS, one ready to run GLYNT and other machine learning applications.
Our production workflow and orchestration system has four key features:
We know our customers have surges in volume and frequently need to run heavy loads through GLYNT. So GLYNT is fully elastic — growing as demand rises, and shrinking as demand abates. It’s capable of supporting high availability via execution in multiple AWS zones.
Customers come to us with a variety of technical capabilities. The pipelining system uses SFTP as a simple interface to deliver files to GLYNT’s orchestration system. A separate folder returns results to the user. The pipelining system is separate from general orchestration and orchestration easily supports a restful API.
A major source of accuracy and precision in today’s machine learning applications comes from tuning of hyperparameters. Our pipelining and orchestration system allows the customer to indicate the level of tuning required by their application.
In addition, we know that speed can be valuable. Customers that need rapid turnaround at scale can authorize and provision an AWS account that arbitrarily expands compute capability for prediction, or even training or tuning of hyperparameters. GLYNT and the pipeline and orchestration systems which support it have fully automated provisioning, allowing custom deployments of the system to fit a customer’s needs, including deployment behind customer firewalls to comply with data governance and security requirements.
Clearly we’re proud of GLYNT. But we’re also very proud of the team that built it, including the Elastic AI Workbench.