I have been playing at work with some very very large sets of data, typically several billions of elements, that are all maintained in a memcached cloud and periodically dumped into files, and for one of my tasks I'm trying to count the cardinality of this set.
For some context, each item contains an IP and some other attributes identifying a person and is encoded in base64, item size is 20 bytes. Reducing the size of an item by removing some fields is not a possibility.
Here is something that emulates my dataset as an in-memory version (thanks to this post for string generation):
import base64, os
dataset_size = 10000000000 # that's 10 billion, be careful if you run it !
big_dataset = [base64.b64encode(os.urandom(10)) for i in range(dataset_size)]
My first approach was to use a hashset like this:
uniques = set(big_dataset)
print "Cardinality: %d" % len(uniques)
While this in theory works fine on a small dataset, as you can guess there is a hiccup:
I've done my homework and found at best some research papers, or some obscure libraries, but part of the goal of this is to understand what approach works and why.
So I'm calling to you Python users, do you know of any algorithm that would help me estimate cardinality efficiently? By complexity I mean I don't care that much about running time complexity, but I'm more focused about space complexity. I don't mind sacrificing a bit of accuracy if it boosts performance tremendously (so I don't necessarily need to know the exact number of uniques, even if that would be ideal, but probably not a viable approach). I would say up to 5% would be acceptable. I'm looking for something specifically in Python for this project.
Thanks for any help you can provide !
As some people noted, I could use Hadoop/MR, but for this specific projects we don't want to go the MR way, and would like to explore algorithms to do this on a single machine efficiently, as this could be applied to a few other different projects.
The number of unique categories in a variable is called cardinality. For example, the cardinality of the Gender variable, which takes values of female and male , is 2 , whereas the cardinality of the Civil status variable, which takes values of married , divorced , singled , and widowed , is 4 .
I would recommend the usage of Hash Sketches, namely (Super)Log Log sketches or Hyper Log Sketches.
You can check and perhaps use and improve the simple python implementation that I made: https://github.com/goncalvesnelson/Log-Log-Sketch
I would advise you to try with a bloom filter. Even with such an amount of data you can achieve extremely low error rates with modest system requirements. Given that you will be using the (roughly) optimal k=ln(2)*(bloom filter size in bits)/(10billions) you can calculate your bloom filter size in bits as -((10billions)*ln(desired false positive rate))/ln(2)^2.
For example with less than 2gigs of memory you can get an error rate of 0.1%. A very fast and extremely simple implementation of all this is http://mike.axiak.net/python-bloom-filter/docs/html/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With