How Big Data can help solve the Distinct Count problem for Retail Businesses

By February 14, 2019 April 29th, 2019 Uncategorized

Retail industry businesses, especially the ones involved in online and other non-store sales, are experiencing significant growth. According to the National Retail Federation, they are expected to grow between 10 to 12% every year. Given this expected growth, the industry will continue to evolve, and retailers will face various challenges mainly due to

  • Changing customer preferences and expectations
  • A highly competitive marketplace

It becomes imperative that retailers keep a close eye on all the factors that impact their business, both in positive and negative ways and continuously measure and analyze them over time to take timely actions. These insights are critical for them to run their businesses successfully and maintain long-term sustainability.

For example, one of our clients, a multinational computer software company selling its own software products wanted to measure its complete customer journey, right from the point of inception where an unknown visitor comes and visits the website, views product, signs up for demos or trials, fills subscription forms, makes payment, all the way down to click-level product usage data. This would help them answer questions like

  • Which features are more relevant and valuable to the customers?
  • Which workflows or tasks correlate more to conversion and retention?
  • How marketing campaigns affect customer journeys?
  • How many unique customers are there in each stage of the business funnel?
  • How many distinct visits are there from each campaign?

These customer journey conversion and retention event metrics can be analyzed by different business units like Marketing, Finance, Sales, Customer Care, etc. to gain valuable insights and help them focus efforts in the right direction to increase customer retention.

To measure these, they would need to count all the events, i.e., distinct visits or cookies on their website from different customers and analyze them over time. A YoY, QoQ, and WoW comparison of these metrics across different dimensions between current and past years would help them understand how their business is performing. Year-over-year(YoY) is a comparison of a statistic for one period to the same period in the previous year. Similarly, one can do Quarter-over-Quarter(QoQ) and Week-over-Week(WoW) comparisons.

Now here comes the real problem. For large enterprises, counting these distinct events can scale up to millions and billions of cardinality, and across terabytes and petabytes of data, and not only counting them but also performing a slice and dice of these numbers across different dimensions can be a big challenge if we try to achieve this using traditional big data technologies like Hive and Impala.

Challenges in performing accurate distinct count using Hive and Impala

In Hive, COUNT (distinct) is a single reducer problem and goes through a massive reduce side sort. The query executes using multiple Mappers and one Reduce stage. Map sends each value to the single reducer, and reducer does all the job. One reducer processing too much data may cause a data skew. Therefore, the “reduce” method becomes an expensive and all in-memory operation. So, if you are operating on several TBs of data – it might take an excessively long time.

Though by default Impala only allows a single COUNT(DISTINCT columns) expression in each query, it does includes a workaround for this limitation like using CROSS JOINS but that’s a costly operation.

Though if we do not need exact accuracy and approximate accuracy is acceptable, then we can produce an estimate of the distinct values for a column in both Hive and Impala.

In Impala, we can specify NDV(column). NDV is an aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col), the “number of distinct values.” It is much faster than the combination of COUNT and DISTINCT and uses a constant amount of memory. Thus, it is less memory-intensive for columns with high cardinality.
To automatically rewrite COUNT(DISTINCT) expressions to NDV(), we need to enable the APPX_COUNT_DISTINCT query option.

While in hive APPX_COUNT_DISTINCT Function implements the HyperLogLog algorithm for approximating the number of distinct elements in a multiset. The HyperLogLog algorithm can estimate cardinalities greater than 1 billion with a typical accuracy of 2%, using only 1.5 kB of memory. We will soon discuss the Hyperlog algorithm in detail.

Accurate Vs. Approximate Distinct count

The major pain areas in calculating the distinct counts are:

  • Overall execution time increases as the size of the data increases
  • Calculating distinct counts requires very high memory to store distinct values, and even if we apply hashing with compression to reduce memory needs, this would still cause an increase in overall execution time.

To better understand this problem, let’s take an example where we want to perform distinct counts on Customer Id where each id can be represented by 16 characters and requires 128 bits of storage. If we are receiving 300 million events per day, it would require 38400000000 bits to save the IDs. Suppose if we use an in-memory HashSet to find the distinct values, it would require a similar amount of RAM plus Java overhead to store objects in memory. The problem would get worse when we want to perform the distinct count on days, months, and years of data. So, we definitely need a better solution for finding the distinct counts.

There are many cardinality estimation algorithms which trade space for accuracy like Hyper LogLog algorithm. Hyper LogLog algorithm approach is used to find an approximate distinct count. It allows the user to specify the desired accuracy by defining the relative standard deviation and the max cardinality you expect to count. This counter can count one billion distinct items with an accuracy of 2% using only 1.5 kilobytes of space, and the efficiency of this algorithm becomes obvious.

Likewise, we have other algorithms that can help us estimate both approximate and accurate distinct count measures. Accurate distinct count algorithms would require more storage space as compared to approximate.

In the Big Data world, since the input dataset spreads over multiple machines, merging the contents of distributed counters created by using the algorithms mentioned above is needed to find the final cardinality. These algorithms already handle collisions, so we can still get a cardinality estimation with the desired precision even though we never brought all of the input data to a single machine. This is very useful and saves us a lot of time and effort moving data around the network.

Why an approximate distinct count solution is not always helpful

As we have seen, approximate distinct count algorithms require comparatively very less memory and space as compared to accurate distinct counts and can provide results which are varying by just 2%. But why can’t this be a solution always and why can’t it replace accurate distinct counts?

To understand this let’s come back to our client’s example who wants to measure customer journey and count distinct website visits. As per the following data for distinct visits:

YearApproximate Distinct countActual Distinct count
2017500m (+2%)490m
2018500m (-2%)510m

There is 0% YoY growth between 2017 and 2018 if we go by the approximate distinct count numbers, but in reality, they got 20 million more visits, giving us a completely incorrect picture if we use approximate distinct count.

So, it becomes imperative to decide which metrics are mandatory to be kept as accurate and which ones we can live with approximate distinct counts since every accurate distinct count comes with an expense of more storage space and more execution time.

How does Kyvos help solve this problem?

Kyvos provides a solution for both Accurate and Approximate distinct counts.

We have already successfully solved this problem for many large-scale enterprises who were dealing with data of millions of distinct count cardinalities. With Kyvos, they can slice and dice these values interactively against n number of dimensions using BI tools, with a response time ranging between milliseconds to a few seconds depending upon the cardinality. This is much faster than Hive and Impala.

Many big enterprises have created multidimensional cubes using Kyvos, containing a combination of accurate and approximate distinct count measures and can successfully derive insights on a day-to-day basis. Incremental refresh of cube helps them keep the data up-to-date, and update the distinct count counters as and when new data gets added.

Though the distinct count cardinalities can scale from millions to billions, enterprises need to make a conscious decision when they are deciding between approximate and accurate distinct counts. More the accurate distinct counts and the cardinality, bigger would be the cube size and more time or resources would be needed to fetch the results.


Leave a Reply

fourteen + eighteen =