Technical Tuesday: Should you process your enterprise documents with LLMs?

office worker sitting at a desk using a laptop and reviewing documents

As large language models’ (LLMs) capabilities continue to expand—evolving from purely text-based capabilities into multimodal vision language models (VLMs) that can understand images, too—business leaders are now asking: ‘Why can't we just process all our documents using ChatGPT or a similar LLM?’ 

The temptation to rely solely on the latest foundational model for your document solution is understandable—but UiPath research reveals that LLMs on their own won’t give you optimal results. This blog explains why LLMs alone aren’t sufficient for end-to-end document processing, and provides a look at the work UiPath is doing to address the underlying limitations of the LLM- or VLM-led approach. 

The pitfalls of using LLMs for document extraction and processing 

LLMs are a transformative technology with immense business value. However, they have specific limitations that curtail their utility in enterprise-level, intelligent document processing (IDP) and related automations. 

UiPath researchers have spent the last six months stress-testing various LLMs, evaluating hundreds of document use cases while building additional capabilities on top of them. From this research, we’ve identified two major issues: 

1. Dead-ends 

Some problems simply can’t be solved through prompting alone. Whether this involves an LLM’s core extraction capability or its performance on a specific field, table, or document, we’ve found that no amount of prompting can generally solve these problems at scale. With an exclusively prompt-based solution, this can lead to frustrating experiences we describe as a ‘dead-ends.’ Throughout our model stress-testing, we’ve encountered multiple dead-ends and are implementing different strategies to remove them: 

Table extraction is one area where we've seen significant problems. Foundational LLMs frequently make mistakes like skipping rows, mixing up columns and rows, or extracting data where it's missing.  

Fortunately, we’ve been able to solve this problem through our new model for complex and unstructured documents in UiPath IXP. The solution is built around intelligent pre-processing, plus a combination of different models, including UiPath computer vision models, LLMs for table recognition, and LLMs for extraction. This ability to determine the right combination of technologies for a given use case yields a powerful advantage for enterprises compared with just using an LLM directly.  

To illustrate, here’s a summary of performance across a sample of challenging document types: 

Average Field F1 Score 

Document Type

Image Only

Enhanced Table Extraction V1

Enhanced Table Extraction V2

Paystubs/pay slips

73.4%

79.7%

85.4%

Insurance renewal documents

90.4%

95.1%

98.5%

Our approach results in a significant improvement, allowing customers to accelerate their time to production and start extracting value more quickly.  

In our experimentation, we’ve experienced similar issues with document elements like checkboxes and signatures. Improving extraction accuracy and performance for these tasks is a key priority for our team. 

2. Automation Readiness 

With simple documents and just one or a small number of samples, it's very easy for an LLM to get an accurate extraction—at first glance. However, seeing fields extracted in a chat experience is not sufficient for automation. You need to drive consistency in the structured schema that's extracted, and you need data typing to ensure the data is in the correct format for your automation.  

This is where the motivation to create UiPath IXP (Intelligent Xtraction & Processing) began. We wanted to provide a simple experience for users to construct automation-ready schemas and prompts, and quickly iterate to drive improvements in extraction. However, to actually use these within the automation of a business process and at scale, you need even more.  

Two other critically important capabilities are attribution and confidences

Attribution 

Attribution, or references/citations, is key when automating a process. Attribution is how a model shows a user where a given field has been extracted from a document. Because human review remains necessary to automate important documents at scale, it’s critical to provide clear attribution for model outputs. Without it, a user would need to manually review the entire document, which defeats the purpose of automation.  

Foundational LLMs do not provide reliable attribution and have been shown to erroneously invent citations in many cases. A lack of reliable attribution limits the utility of LLMs in document processing. UiPath IXP overcomes this challenge as attribution for foundation models has been built into the core experience. It’s also a part of the experience we’re continually iterating and improving on. 

Confidences 

Confidences—the degree to which a person can trust a specific output from a model—are critical for ensuring human-in-the-loop review is both effective and efficient. Users need a simple metric to help them decide whether a model output should be reviewed by a human. The only alternative is manually reviewing everything.  

Some foundational models output log probabilities (logprob), which can be used to return confidences. However, while directionally informative, they don’t provide the beautifully smooth precision recall curves typical of traditional models.  

Our main takeaway is that the standard logprob approach isn’t ideal. It's possible to make more requests to get more useful confidences that are better at distinguishing between good and bad predictions, but this is significantly more resource-intensive. While confidences are important, users also need to perform tasks like building validation rules on top of extractions—such as applying mathematical checks and regular expressions—to ensure outputs are accurate and automation-ready. We’re currently building features into UiPath IXP to make these a core part of the experience, while also developing better ways to provide more useful confidences.  

In-context/few-shot learning 

Another technique we’ve explored: using annotated examples in model prompts to improve performance—an approach known as ‘in-context’ or ‘few-shot’ learning. However, to date, the results have been mixed. Where the content is generally short (like emails), we’ve seen that if you provide relevant examples, you can significantly improve performance. We’ve also found that in-context learning is effective for making targeted improvements, like when there are extraction issues on a specific table in a complex document.  

However, in-context learning also has its drawbacks. First, it can lead to ‘overfitting’ by the model, as it applies the same behavior to similar tables in the document. And we haven’t yet seen strong results on longer document types.  

The wider research community has found that in-context learning with VLMs doesn’t yield comparable results with text-based models—but our research here continues.  

A roadmap for end-to-end document automation 

UiPath uses foundational LLMs extensively in its intelligent document processing (IDP) capabilities, whether behind-the-scenes to accelerate customer time to value, or up front as the model for extraction. We’re continually building our own solutions on top of state-of-the-art technologies specialized for targeted use cases, while providing support and access to the latest frontier models. 

However, we don’t assume LLMs alone will solve the many challenges of document processing. LLMs aren’t trained specifically on enterprise documents, or for the many different use cases required by our customers. Furthermore, LLM providers aren't prioritizing the improvement of attribution or confidences in document processing. That’s why UiPath researchers and engineers have developed a multi-pronged strategy to tackle these issues: 

  • Train and extend our own fine-tunable models for complex and unstructured documents; 

  • Improve our pre- and post-processing capabilities to supplement foundational models; and 

  • Continuously evaluate additional LLMs and make recommendations for different use cases. 

We’ve already made good progress on our first objective and will announce more advances in the near future. In addition, we’ve taken significant strides in our pre- and post-processing capabilities, which will remain a key area of research and investment. Next up, we’ll address performance issues when extracting from checkboxes and signatures. We’ll also continue our efforts to improve attribution for LLMs, following an iterative process where we find edge-cases and release improvements over time. 

Lastly, we’re making it easier for UiPath users to bring in more models for their specialized document processing tasks, while giving them a holistic experience to quickly evaluate models’ performance and scale them. Our approach will help our users adapt to changing conditions while also providing a mechanism to try alternate models for specific use cases. 

Ultimately, we'll need all three of these strategies to deliver more accurate and reliable extraction across all document and content types. Our goal is to combine methods and technologies to provide the best performance for our customers and their unique use cases. UiPath IXP will evolve to intelligently orchestrate and combine multiple models—selecting the most effective approach for each document processing task—while incorporating the latest advances in pre- and post-processing. 

Want to learn more about our open and flexible approach to document processing? Visit our capability page and begin your trial of our new generative extraction for complex and unstructured capability.

George Barnett UiPath
George Barnett

Senior Director, Product Management, IXP, UiPath

Get articles from automation experts in your inbox

Subscribe
Get articles from automation experts in your inbox

Sign up today and we'll email you the newest articles every week.

Thank you for subscribing!

Thank you for subscribing! Each week, we'll send the best automation blog posts straight to your inbox.