How AI Developments Impact BI Data Volumes

What this blog covers:

How is unstructured data rising in volume, and ways in which AI helps capture it?
Understanding the value trapped in unstructured data for deeper analytics.
What makes Modern OLAP the best solution for handling unstructured data?

Listen 10:00

I know of many cases where an IT manager wakes up in the morning to find that their data resource needs have grown a magnitude or more overnight. Examples of causes range from merging with a much more data-intense company, a smallish company landing the “McDonalds” or “WalMart” of their industry, they’ve decided to deploy a grids of IoT sensors, or data will be at a lower level of granularity.

More recently, another reason for a 10x rise forced its way onto the data scene. It’s the unprecedented AI wave generated by the recent hype around ChatGPT. Until now, in my world of business intelligence (BI) customers, the notion of AI has usually been an issue of “nice to have” or “we’re not there yet”. But I believe the hype around ChatGPT will instigate an imperative for adoption. It’s forcing all of my data and analytics colleagues to rethink the data use cases and infrastructures.

What does it mean when resource requirements suddenly rise 10x in terms of compute requirements along with the consequent rise in cost? We look for optimization solutions. For this article, the optimization solution is an old friend – highly-scalable, pre-aggregated OLAP cubes, which took a back seat during this past decade and focused on the grand scalability of Big Data and the Cloud.

In this article, I will first discuss how the 10x increase in volume is the immense volume of data currently trapped in unstructured data unlocked by AI. I will then discuss the value of that unlocked data in terms of BI. Finally, I discuss the re-introduction of highly scalable, pre-aggregated OLAP into the mainstream of BI.

The Unstructured Data Majority

When I ask a customer how much data they have, they say something along the lines of: “About 20 terabytes, but most of it are images and PDFs.” Traditionally, the data loaded into BI data stores has been limited mostly to the minority of an enterprise’s data. At least in terms of volume.

Although it varies, it’s often said that around 80% of an enterprise’s data volume is unstructured. Examples include images, videos, audio, and text documents (ex. PDF, Word documents, and Web pages). But that ratio of unstructured to structured will probably grow as the feasibility of producing and storing large objects, especially videos, is technically and financially feasible.

BI systems originally included only structured data such as from relational databases. More recently, semi-structured data, particularly JSON, have been pretty much baked into the BI data mix. Such semi-structured data is usually about data being in an easily interchangeable, readable format (by humans and major software applications) in a flexible schema. Prior to feasible AI, unstructured data was primarily only human readable, so information trapped in it was mostly left there.

However, unstructured data not only dominates in terms of sheer volume, but we can sense that its tide is rising. For example, we sense that due to Covid-induced remote work and social media, our video data assets are growing at a tremendous rate. Through the big shift towards remote work, we capture many GB of recorded sales demos as a vendor and customer, team meetings, and brown bag tutorials. We offer marketing and tutorial videos on social media.

An army of customer-facing knowledge workers armed with an arsenal of multi-media recording options interacting with vast populations of customers today generates troves more data than what a traditional series of questions and terse answers on a form can capture.

The key to unlocking the information are the readily accessible AI engineering platforms, (ex: Azure Cognitive Services) that are able to crack open the data typically trapped in an enterprise’s unstructured data. Prior to the emergence of readily usable AI tools, the data locked in unstructured formats was only mined through extremely laborious and error-prone human data entry. Or for the adventurous, expensive, difficult technologies that made it seemingly not worth the effort and expense.

This reminds me how cooking food made a lot more nutrients available to our distant ancestors from the same food and food that isn’t edible raw. The sudden abundance of nutrition freed us to explore other aspects of life than just gathering food.

The Value of this Unstructured Data

Videos, images, and audio capture what we see and hear. Vision and hearing are the two senses we depend upon the most. Yet we only scratch the surface of the information embedded in those sights and sounds.

Since we do go through the time and cost of storing unstructured data, it’s obviously valuable. But the real value is at best trapped in the heads of experts who read, digest, and archive it. For example, legal documents are certainly filled with valuable enterprise data, but it’s mostly trapped in the heads of the corporate attorneys. Eventually, because the human attorney moves on to the next issue and there is only so much one human brain can memorize, the value within the totality of documents is forgotten and lost in dusty cold storage tiers.

Think about a recorded sales demo over Teams/Zoom. Let’s say it’s about an hour long, it is attended by around six people. Many requirements were communicated, along with many capabilities. Many questions were asked and answered. Many emotions and feelings, consciously or subconsciously, were conveyed. Such details and context are only incorporated into decisions and action by the few that attended or watched later. But it doesn’t appear in any BI dashboard.

Surely others on both the vendor and customer side who could not attend would find value in consuming the video. But an hour is a long time. Even at 1.5x, that’s still 40 minutes. In order to save time, some might skip over the introduction, or not pay attention at certain times. But in the introduction might be an important comment by the customer. Or what appears to be a topic on costs that an engineer might not pay attention to could be a comment on could hint at a make or break concern.

Every day for years I send and receive at least a few business-related photos on my phone from colleagues. They include screenshots of some sort of error message, a messy dashboard someone likes or doesn’t like, or a conference room being set up. These are scenes too cumbersome to put into a text or type into a form.

Text and information extrapolated from relative positions in an image or video scene highlight relationships between various identified text and/or objects. For example, an image of a PowerPoint slide in a sales demo could be recognized and its information captured. Perhaps it’s a slide of the titles and organizations of attendees. Or images may contain collections of things (‘baskets”) seen in other images, across time and locations.

All of these data points would form dimensions that span across a library of images and videos. For example, correlations across sets of people and/or things, the mention of key words, positive or negative sentiments of all matters of things.

How much Information is Trapped in these Unstructured Sources?

Depending on the picture and the nature of our business, the old cliché about a picture being worth a thousand words might be a good start for a high estimate. Further, a video is composed of hundreds to thousands of pictures, so that’s a few magnitudes more “words”.

In reality, data points extracted from the raw unstructured data is still far less voluminous. But there are potentially so many things that could be said. The extraction and annotation of information mined from unstructured sources could take us from tens of KB or MB of sales transactions per customer augmented to tens of MB or GB per customer of highly detailed and sequenced data.

Considering the same sales demo video from above, what can we extract from it? To start, there is the transcript, which is generated through a type of AI. The transcription even recognizes who said it, and of course, in what order things were said. The transcription is still mostly unstructured text, but another type of AI could extract key words, key phrases, and even perform sentiment analysis.

Questions by attendees could be recognized and automatically linked to an answer in a knowledge base or passed on to something like ChatGPT.

Based on the conversation, the video could be automatically broken into chapters or “scenes”. For each scene, we would have derive a collection of recognized objects and even an automatically generated subject description.

A more sophisticated data extraction would be to analyze facial expressions or even position and movement of the pupils. That’s admittedly a bit creepy. But think about how much information our brains pick up from subtle actions and body language that isn’t available to analysts without sitting for hours watching over videos.

Think about video of the floor of a manufacturing plant or hospital ward. Currently, data used towards optimization is mostly limited to device/equipment signals (in an Internet of Things way). Video can capture many things that happen we would probably not have thought about. But practically all the physical data captured in the video is left out of analysis trapped in the video jail.

Objects in PowerPoint presented slides we see on Teams/Zoom (i.e. we don’t have the PPT) could be recognized and analyzed. For example, brand logos and other objects could provide clues as to context. Even a summary of the slide as a whole might be extracted.

The Implications of Cracking Unstructured Data

The implication of what could be a 10x rise in data volume available to BI is the corresponding increase in time and compute. An intense query of today, say of 10 billion rows, would compute in a reasonable time, but it probably won’t be the case with around 10x as many rows.

In the realm of data and analytics, we’re not talking about just storing a magnitude more data. That’s the easy and relatively inexpensive part. Querying a magnitude more data is another story. Analytics (OLAP) queries return results computed from large volumes of data. That is in contrast to transactional (OLTP) applications where we might request data of one customer at a time. Those OLTP queries involve just a few KB each no matter how many customers there might be in the database.

At the time of this writing, an analytics query such as “what are the top 20 products by sales in the US and Asia for non-food items” may cover, say, 10 to 20 billion rows. The current level of Cloud data warehouse platforms such as Redshift and Snowflake are capable of handling that load fairly well. That is, within the order of tens of seconds – with just a few concurrent users.

However, such intense queries covering a large number of rows probably constitute a minority of queries, while vast majority are more targeted and so take a second or two. Therefore, most users would find this reasonably acceptable.

But what if instead of 10 or 20 billion rows, there is now 100 billion to over a trillion. Those occasional queries that once took tens of seconds might now take a few minutes. And some percentage of those queries that once took one or two seconds might take a couple tens of seconds. Those timeframes might now be too tedious for the analyst.

During most of the 2000s, the solution to accelerating analytics queries was to implement pre-aggregated OLAP cubes such as those implemented through SQL Server Analysis Services (Multi-Dimensional) or Essbase. Using those pre-aggregated OLAP cubes, an analytics query would take less than a second two a few seconds – whether the result computed from one fact or a billion facts (a lot of rows circa 2000-2010).

The downside for an IT manager is the introduction of yet another technology, the pre-aggregated OLAP cube. Every technology employed by an IT department requires an ongoing level of expertise to develop products on it and support it. IT managers would need to ensure the skillset exists for the entire life of the use of the technology.

As mentioned, the state of performance today is such that the “normal” amount of enterprise data is serviced adequately well by current Cloud data warehouse platforms. An IT manager would consider the trade-off between good-enough alternatives built into their Cloud DW platform (ex. materialized views, caching) and introducing another vendor and technology.

However, the magnitude increase in volume courtesy of the release of information from unstructured data seems to have reached a point where a few weights can be placed on the side of the scale of the once venerable technology of pre-aggregated OLAP cubes.

But the length of query time is just one side of the coin. The distributed, scale-out Cloud database technologies enable the highly performant processing a massive volumes of data. However, scale-out isn’t a free lunch. It’s very expensive. The Cloud only promises to scale to great heights, but it’s going to cost you.

Many prominent Cloud data warehouse platforms charge by compute time. In other words, computing over a 100 billion raw rows costs more than computing over a few hundred or thousand pre-aggregated rows. For example, if a data scientist issued large queries many times to the Cloud DW, there will be a substantial surprise bill at the end of the month. With pre-aggregation, it’s read-once (to process the aggregations) and query a small amount of highly compacted data many times.

Finally, as enterprises become more data-driven, many more employees will consume analytics data, beyond traditional analysts and data scientists. This means that user concurrency will increase as well. Of course, increased concurrency can be addressed through scaling out – but that is expensive. Or it can be addressed through the optimization of pre-aggregated cubes.

Kyvos can help do that with our next-gen modern OLAP capabilities. To know more about us, schedule a demo with our team today!

Request Demo

Tags:

The Effect of Recent AI Developments on BI Data Volume