Here is the real world issue that we are solving. We have some rather large data sets that need to be aggregated and summarized in real time with a number of filters and formulas applied to them. It works fine to apply these to each record in real time when the data set is less than 50,000 records but as we approach 100,000 and then 100+ million the overhead of doing realtime math against all of the records gets too great. We have spent a lot of time in SQL doing optimization and then supposing throwing entire datasets in ram and we still come to the conclusion that we need to “zoom out” from the data and summarize groups. We need a way to group like records together and then apply the math to a group of “like records”. This clumping of records allows us to be really fast and doing real time reporting. Our current solution groups record sets that are exactly the same.
Here is an example record
ID77968 1, 43:19.7, 43:19.7, TRUE, 1, 3, 0, 4, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0.1 0, 0, 3, 14, 79,
So if we have 2 of these with exactly the same data
ID77968 1, 43:19.7, 43:19.7, TRUE, 1, 3, 0, 4, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0.1 0, 0, 3, 14, 79,
ID77969 1, 43:19.7, 43:19.7, TRUE, 1, 3, 0, 4, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0.1 0, 0, 3, 14, 79,
then we create a group. Then we can apply the math and logic to a single group and then multiply the outcome by 2 to get the real answer. This works really well for us and can be super helpful in getting around the scale issues of items. That said we now have a new problem. Some of our values have larger ranges of outcomes which is creating datasets of thousands of records with only a couple of them being the exact same. After some brainstorming we came up with an idea of applying some “fuzzy” logic to group things together that are similar. The issue that we have now is that we don’t know the best statistically sound way of going about reducing the record set into groups that aren’t exact the same.
Here is what we need to do. (Simple example, single column)
Suppose we have the following numbers 20 numbers
106
0
8
0
1
0
4
0
3474
0
204
0
75
0
128
0
617
0
20
0
In the above set we have a lot of 0’s so these are easy to group together. But how do I form let’s say 3 more groups. I suppose on the outer bound we have 3474 but given the weighting below that number the outbound group might be something like 2000 and then values 3474 and 617 would be combined to a single group. Our team meeting thought of this as a gravity problem or better known cheerio attraction. Ideally we would find a equation or approach that would let us look at the entire record set and then say..express this in X number of groups. This would allow us to vary the grouping/clumping of the data. So suppose we use the example 20 numbers above and want to express this in 15 groups vs 8 groups we would be able to do this. Now remember that in the example above this is just a single column, but I am trying to group entire records like
ID77968 1, 43:19.7, 43:19.7, TRUE, 1, 3, 0, 4, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0.1 0, 0, 3, 14, 79,
ID77969 1, 43:19.4, 43:19.7, TRUE, 1.2, 3.2, 0, 3, 2, 1, 1, 1, 1, 0, 0, 0, 3, 0.1 0, 0, 3, 14, 179,
Thanks for the help in advance
We currently hash each record as it comes in and then if a record has the same hash, we group it. the issue with hash here is that if it istn exactly the same then, it wont be grouped. This has worked for us for a while becuase our values in each column as relatively bounded. We now have introduced some values that have far greater ranges which has rendered our hash grouping ineffective. Before we were able to take 100mm records and hash them to ajust over 100k groups but now we are seeing neew data in sets that are just 70k with all 70k being unique. Deidentified data here: Copy of rv.zip (3,58 MB)
The three common ways of looking at the center are average (also called mean), mode and median. All three summarize a distribution of the data by describing the typical value of a variable (average), the most frequently repeated number (mode), or the number in the middle of all the other numbers in a data set (median).
Summary statistics is a part of descriptive statistics that summarizes and provides the gist of information about the sample data. Statisticians commonly try to describe and characterize the observations by finding: a measure of location, or central tendency, such as the arithmetic mean.
The three main measures that summarize the center of a distribution are the mean, median, and mode. While there are several different types of mean, we will focus on the arithmetic average.
I agree that we need more information to give you the best advice.
If you need real-time analysis of large amounts of incoming data, you may want to look at streaming algorithms in general. Many streaming algorithms do not actually attempt to store all of the incoming data, but rather use each new datapoint to update a live model that you care about. Such algorithms are very useful in cases such as network traffic analysis where high-level trends are more important than each individual packet.
One example would be online k-means which does not actually store each new data point and then run k-means on the entire dataset, but rather updates the current set of means using the new datapoint and then discards it.
Another thing to consider would be some type of lossy compression of related records like vector quantization (VQ) (there are many different flavors of VQ, both supervised (LVQ) and unsupervised). This would allow you to replace "similar" records with a prototype record representative of the group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With