Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you count cardinality of very large datasets efficiently in Python?

I have been playing at work with some very very large sets of data, typically several billions of elements, that are all maintained in a memcached cloud and periodically dumped into files, and for one of my tasks I'm trying to count the cardinality of this set.

For some context, each item contains an IP and some other attributes identifying a person and is encoded in base64, item size is 20 bytes. Reducing the size of an item by removing some fields is not a possibility.

Here is something that emulates my dataset as an in-memory version (thanks to this post for string generation):

import base64, os

dataset_size = 10000000000 # that's 10 billion, be careful if you run it !
big_dataset = [base64.b64encode(os.urandom(10)) for i in range(dataset_size)]

My first approach was to use a hashset like this:

uniques = set(big_dataset)
print "Cardinality: %d" % len(uniques)

While this in theory works fine on a small dataset, as you can guess there is a hiccup:

  • I can't make any assumption on the uniqueness of my data. I could have 50% of my dataset that is unique, or I could have 100% just as well. This is generated dynamically at regular time intervals and varies depending on a lot of factors (time of day for example)
  • Dataset size in 10 billion. Each item encoded in base 64 is 20 bytes, times 10 billion is a few hundredids gigabytes in average. Unfortunately, I don't have access to a machine with that much RAM !

I've done my homework and found at best some research papers, or some obscure libraries, but part of the goal of this is to understand what approach works and why.

So I'm calling to you Python users, do you know of any algorithm that would help me estimate cardinality efficiently? By complexity I mean I don't care that much about running time complexity, but I'm more focused about space complexity. I don't mind sacrificing a bit of accuracy if it boosts performance tremendously (so I don't necessarily need to know the exact number of uniques, even if that would be ideal, but probably not a viable approach). I would say up to 5% would be acceptable. I'm looking for something specifically in Python for this project.

Thanks for any help you can provide !

As some people noted, I could use Hadoop/MR, but for this specific projects we don't want to go the MR way, and would like to explore algorithms to do this on a single machine efficiently, as this could be applied to a few other different projects.

like image 799
Charles Menguy Avatar asked Apr 15 '12 18:04

Charles Menguy


People also ask

What is cardinality in Python?

The number of unique categories in a variable is called cardinality. For example, the cardinality of the Gender variable, which takes values of female and male , is 2 , whereas the cardinality of the Civil status variable, which takes values of married , divorced , singled , and widowed , is 4 .


2 Answers

I would recommend the usage of Hash Sketches, namely (Super)Log Log sketches or Hyper Log Sketches.

You can check and perhaps use and improve the simple python implementation that I made: https://github.com/goncalvesnelson/Log-Log-Sketch

like image 134
goncalvesnelson Avatar answered Sep 26 '22 12:09

goncalvesnelson


I would advise you to try with a bloom filter. Even with such an amount of data you can achieve extremely low error rates with modest system requirements. Given that you will be using the (roughly) optimal k=ln(2)*(bloom filter size in bits)/(10billions) you can calculate your bloom filter size in bits as -((10billions)*ln(desired false positive rate))/ln(2)^2.

For example with less than 2gigs of memory you can get an error rate of 0.1%. A very fast and extremely simple implementation of all this is http://mike.axiak.net/python-bloom-filter/docs/html/

like image 22
yahe Avatar answered Sep 24 '22 12:09

yahe