Capturing the Long Tail in Documents, Part II

Feb 16, 2021

AI for Documents: Learn the Second Key Question

Imagine a chocolate factory where you can build the chocolate bar of your dreams: Pick your cocoa beans and toppings, and the factory processes it into one delicious chocolate bar ready for you to enjoy. Everyone who uses GLYNT wants to select lots and lots of fields and capture all the data items they have wanted for years, just like kids in a chocolate factory. Since GLYNT is built on ‘Few Shot’ machine learning with on-demand training, getting to that chocolate factory experience is easy. But…

The Challenge

Once GLYNT delivered all of the requested data, our early users realized that lots of documents leads to lots of document variety which leads to incredibly jumbled up data feeds. Even simple documents, such as healthcare insurance cards and pay stubs have state-level regulations to meet, creating a huge variation in how fields are laid and what they are called.

API integration engineers were brought to their knees. How could the data be reliable? Every field seemed to have a different format and meaning. Business users screamed in pain. Automation wasn’t working, every document got routed back to the manual data entry team for a cleanup. Sound familiar?

The Semantic Layer

To tackle the problem of document variation, we added the Semantic Layer to GLYNT, a built-in method to untangle the variations in documents. Now that the Semantic Layer is in place, data integration specialists get reliable, structured data that slips into their target database or destination.

Business users can see the exact method GLYNT uses to normalize varied documents. No judgements, just scripts. That’s how business users get automation with the transparency they crave.

Categorizing Data

GLYNT has developed a system of data identification and categorization using field-level metatags. We built it from our years of experience in the energy markets, and we’ve found that the same system works on other data-intense invoices, healthcare records and government documents too.

All of these documents have lots of data, contained in nested tables, and with data context spilling across pages. GLYNT’s Semantic Layer keeps it all together and organized.

An Example of the Semantic Layer

Here is a use case for our Semantic Document Layer that comes up all the time. GLYNT processes utility bills and they often have multiple services on the same bill.

For example, in Northern California, PG&E delivers gas and electricity to businesses and homes. So the simple question “What is your account number?” has three answers: the billing account number, the gas account number and the electricity account number.

GLYNT uses meta tags to help the user categorize the data we extract. The tags are applied to the data items.

Delivering Data With Context

In our view, the Semantic Layer for a document type is complete when the user has complete context for the selected data items. To make this clear, let’s look at school transcripts. See how this transcript is transformed into reliable, structured data.

The course data for Spring Semester in Year 1 is great stuff, but one needs the information in the header area to provide context, such as “Which student?” “Which year?” And so on.

So how does an AI system provide context for data? The GLYNT Semantic Layer makes this possible.

Here is an sample of the output for Spring Semester Year 1:

The data from the document is organized so that every course taken has complete context. GLYNT has automated the Semantic Layer and a second tool, the Transformation Layer, to remove document complexity and transform the data from documents into a format your systems can immediately use. You may not want all the data on every transcript, so with GLYNT you can customize what you get.

The Key Question to Ask

The one key question to ask of your documents to data vendor is:

“Will the data provided be immediately consumable?”

Anything but an unqualified “yes” in reply will raise your overall cost. Here are a few follow-on questions to help you check the boxes:

Business Question	Others: Top-Down AI	GLYNT: Bottom-Up AI
Can we tell you what fields we want? Or add a field?	No. Fields are set. If you want an additional field we can’t help.	Yes. Just select any field you want.
Can you get data items on any page?	No. We only get the fields we set up from the start.	Sure!
Can we choose the file names?	No. We have a list, or we use what is printed on the document	Sure!
Can you provide context for data on a table?	Maybe. If it is a standard table on a document we’ve trained on.	Sure!
Can you de-nest table data?	No.	Sure!
Can you do sub-tables of tables?	Maybe. Only if it is a standard sub-table on a document we’ve trained on.	Sure!
How many fields do you get per document?	Typically 12 – 25	As many as you want!
Can you customize the data output to my structured data schema?	No.	Sure!

The Bottom Line

A big shout out to our customers, because they led us through the chocolate factory back to the refining process, and to the development of the Semantic Layer. It has proven to be incredibly useful to our customers, and importantly, it is a simple but powerful approach that works across industries.

The bottom line is that anything but immediately consumable and reliable data from documents is a big headache and a source of unending costs. Documents have huge variation, even within an industry or document type. Get a solution that knows how to conquer document variety.

Want to give GLYNT a try? Simply reach out for a customized demo on your documents. Let us show you what GLYNT is all about.

GET IN TOUCH