Capturing the Long Tail in Documents, Part I

Jan 25, 2021

AI for Documents: Learn the First Key Question

It was only a few years ago that innovators started including AI-powered features in enterprise software, but it’s been just long enough for patterns to emerge.

One pattern is a bit disturbing: AI often has Diseconomies of Scale. Meaning, instead of unit costs falling as more units are produced (Economies of Scale), unit costs increase with the number of units produced, which is Diseconomies of Scale. Yikes! This conclusion applies to many AI systems that transform documents into data.

The Challenge of the Long Tail

To understand why AI might have Diseconomies of Scale, let’s take a look at a familiar AI application, chatbots. Consumers pose thousands of questions. How many replies can be automated? Here is what venture capitalists are reporting: About 20%

The problem is that there is a long tail of infrequently asked questions. It is expensive to catalogue them, get enough to train the AI, organize the content for AI learning, and maintain the AI as questions drift to new types over time. Soon the AI system manager is facing increasing costs, in that it costs 2-3X more to train the AI to do a new question than it did to set up the initial frequent questions.

So a responsible chatbot manager makes the cost tradeoff. Some questions are more cheaply answered by AI, some more cheaply answered by humans. That’s how the 20% gets set.

Top-Down AI

Documents have the same long tail experience. Every customer knows that some documents are frequent and regular, but even a simple document such as a driver’s license will have a lot of variations: At least one version for every state, then there are auto, motorcycle and truck licenses, and then… the long tail emerges.

The AI system most frequently used for chatbots and documents is built on a huge library of examples, typically 200,000 or more. For documents, this means the fields are selected, the correct data is marked, and the AI is trained. The system is called Large-Corpus AI (reflecting the size of the training set) or Top-Down AI (reflecting the pre-selection of fixed fields).

Bottom-Up AI

Now consider GLYNT’s Few Shot machine learning system. It trains on just a few sample documents. This means only 3-7 documents are needed in a training set, and any field can be selected and trained. Importantly, the costs of training the first document group are the same as the 100th or 10,000th document group. GLYNT has flat unit costs.

Plus, GLYNT is building out libraries, enabling faster training because we’ve seen the document type before. The library’s assets are document math, not customer data, preserving the privacy of every tenant on our system. Few Shot plus Libraries delivers decreasing unit costs to add a new document type, and delivers Economies of Scale. Because GLYNT uses so few training documents and has field selection flexibility, it is Bottom-Up AI.

The Key Question to Ask

Obviously we think you should ask a potential vendor if they are Top-Down AI or Bottom-Up. What might not be as obvious is all the business implications of that single question.

Business Question	Others: Top-Down AI	GLYNT: Bottom-Up AI
Can you automate our long tail of document variety?	No. We persistently route a lot of documents to human teams.	Yes. We’re set up to automate nearly every one of your incoming documents
What is your accuracy rate?	Very high if the document is frequent. Very low if the document is infrequent. Blended rate of accuracy is middling.	Same very high accuracy for every document type.
What if there is an error?	File a ticket. If we get enough of the same request, we may be able to add the document group to our training system. Wait months.	Re-train GLYNT in minutes.
Does your AI “learn?”	Sort of…If it is s frequently seen document and we go back and update our very large training set, and we have enough examples…Yes, our system learns.	Yes. You can teach GLYNT with just 1-3 example documents.
Do we get your best AI, the one that aggregates learning from all your customers?	Yes. We’ll do transfer learning and federated learning to bring you the advantages of seeing all the documents. But it is costly to manage such a complex system, so expect higher prices.	Yes. GLYNT shares the math models of documents we’ve seen across customers. This speeds up your training experience, so our efficiency lowers your cost.

The Bottom Line

Considering a document automation solution? Ask your vendors the single key question: Is your AI Top-Down or Bottom-Up? And if they look confused, ask them the list of questions above. You’ll be able to decipher the answers.

As you look for a document solution, remember every aspect of the AI system impacts your ROI:

• More AI coverage of the long tail → Lower costs

• Higher accuracy → Lower costs

• Faster error fixes and training → Easier to maintain and lower costs

• An AI that learns quickly → Easier to maintain and lower costs

Ready to try GLYNT? Contact us for a demo tailored to your documents! We’d be happy to put GLYNT to work.

GET IN TOUCH