Sharding Big Data in UiPath Process Mining

Sharding Big Data in UiPath Process Mining

How to handle large data volumes in UiPath Process Mining?  Nowadays, all data analytics activities face the same challenge–handling big data. Several trends in the last decades have only made this problem worse.

On one hand, the amount of data gathered is immense. In the last few years alone, we have created more than 90% of the data in all our history. It is mind -boggling to imagine how much data this actually is!

The way we handle big data has changed as well. A decade ago, a data analyst could spend hours configuring a data-mining algorithm or writing queries to a reporting system. The data analyst would press a button to execute the query and would wait minutes—sometimes hours–for the answer to his question. 

Currently, that principle is not true anymore. Like many other data analytics techniques, process mining is making a move towards a more business-oriented audience. This transition really changes the game.

Business users want an easy-to-use tool which gives them relevant insights, fast. They expect a user experience similar to what they know from their smartphones. So, instead of waiting minutes, they want results in seconds.

How does data affect speed?

UiPath enables you to make ‘governed self-service’ process mining applications. What exactly does this mean?

UiPath Process Mining gives business users a contained information space that is very easy to use. Users get the insights they need to optimize their business processes. However, the speed of such an application must be fast enough to keep these users engaged.

Our software developers at UiPath love performance. They are continually making step-by-step improvements to the overall speed of the UiPath Process Mining product. But does it only depend on their efforts and extra hours spent in the office?

Indeed, there are many other factors that determine the speed of process mining. The following is a very simple rule of thumb that applies to all data analytics tools: “The more data you put in an application, the slower it gets.”

Performance scales in the number of records used in the app. It’s the number of records in your largest dataset that has the biggest impact on performance. In process mining, that usually is the event log itself. Remember, it’s a very simple trade-off. The more data you put in, the slower it gets.

How to improve performance?

One solution is to reduce the amount of data records that are loaded. For example, you could limit the time-period from ten years to only one year. However, that’s not always desirable.

And what if you want to load a drastically higher number of data records: say, ten times, 100 times, or maybe 1,000 times as much?

Sounds impossible? The UiPath innovative solution to this problem is called “sharding.” 

What is sharding?

With sharding, you divide the original dataset into multiple shards. The smaller each shard is, the faster each shard will be. When a user logs in, the corresponding data shard will be loaded.

A typical unit for sharding would be “company code” or “department.” For example, if you have 50 company codes, each shard will contain one company code, and essentially be 50 times faster than the original dataset.

User management can be isolated per shard, such that users can be managed separately. Using the Process Mining User Sync functionality, information about who belongs to which shard can be loaded automatically without extra configuration for each new user.

Development is easy because you only have to develop, maintain, and deploy one single application. It can be used for all shards, because the data structure of each shard is the same.

Now, you might be wondering: what if I want to compare all my company codes? Is that still possible with sharding? 

Benchmark shards

While sharding vastly improves performance per shard, you lose the ability to compare over shards. To get that overview back, Process Mining has “benchmark shards” that combine the data of multiple shards into one benchmark.

To make sure the benchmark shard performs better than the original dataset, we must somehow reduce the data per shard. There are multiple ways to do that.

1. Pre-aggregation

We can pre-aggregate values over shards, or any other attribute in the dataset. This prevents you from doing all detailed analyses, but you are still able to compare differences over shards.

2. Lower level of detail

With Process Mining, a typical benchmark shard removes levels of details in the events. We can filter out all fine-grained events, and only keep the high-level events. This enables you to compare processes on a coarse level.

3. Tagging

The unique ability to tag interesting situations in Process Mining works like a charm in combination with benchmark shards. You can even remove all event data and keep the tags of their respective cases. This makes it easy to compare tags over multiple shards.

Combining shards

The combination of a benchmark shard, and many normal shards gives you the best of both worlds. A high-level overview to compare shards, and a possibility to zoom in to a specific shard, to see all fine-grained details available.

UiPath Process Mining gives your business users a great user experience, by switching seamlessly from benchmark shard to a specific shard and back. High-level management can see the overall picture, while you can still zoom in to all the details. And the cherry on the cake–all of this can be done at great speed!


Learn more about UiPath Process Mining on our Academy


Martijn Wijffelaars

Product Management Director, UiPath