Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to store millions of arrays, and perform IN check

There are around 3 millions of arrays - or Python lists\tuples (does not really matter). Each array consists of the following elements:

['string1', 'string2', 'string3', ...]  # totally, 10000 elements

These arrays should be stored in some kind of key-value storage. Let's assume now it's a Python's dict, for a simple explanation.

So, 3 millions of keys, each key represents a 10000-elements array.

Lists\tuples or any other custom thing - it doesn't really matter. What matters is that arrays should consist strings - utf8 or unicode strings, from 5 to about 50 chars each. There are about 3 millions of possible strings as well. It is possible to replace them with integers if it's really needed, but for more efficient further operations, I would prefer to have strings.

Though it's hard to give you a full description of the data (it's complicated and odd), it's something similar to synonyms - let's assume we have 3 millions of words - as the dict keys - and 10k synonyms for each of the word - or element of the list.

Like that (not real synonyms but it will give you the idea):

{
    'computer': ['pc', 'mac', 'laptop', ...],  # (10k totally)
    'house': ['building', 'hut', 'inn', ...],  # (another 10k)
     ...
}

Elements - 'synonyms' - can be sorted if it's needed.

Later, after the arrays are populated, there's a loop: we go thru all the keys and check if some var is in its value. For example, user inputs the words 'computer' and 'laptop' - and we must quickly reply if the word 'laptop' is a synonym of the word 'computer'. The issue here is that we have to check it millions of time, probably 20 millions or so. Just imagine we have a lot of users entering some random words - 'computer' and 'car', 'phone' and 'building', etc. etc. They may 'match', or they may not 'match'.

So, in short - what I need is to:

  • store these data structures memory-efficiently,
  • be able to quickly check if some item is in array.

I should be able to keep memory usage below 30GB. Also I should be able to perform all the iterations in less than 10 hours on a Xeon CPU.

It's ok to have around 0.1% of false answers - both positive and negative - though it would be better to reduce them or don't have them at all.

What is the best approach here? Algorithms, links to code, anything is really appreciated. Also - a friend of mine suggested using bloom filters or marisa tries here - is he right? I didn't work with none of them.

like image 291
Spaceman Avatar asked Feb 18 '14 13:02

Spaceman


1 Answers

I would map each unique string to a numeric ID then associate a bloom filter with around 20 bits per element for your <0.1% error rate. 20 bits * 10000 elements * 3 million keys is 75GB so if you are space limited, then store a smaller less accurate filter in memory and the more accurate filter on disk which is called up if the first filter says it might be a match.

There are alternatives, but they will only reduce the size from 1.44·n·ln2(1/ε) to n·ln2(1/ε) per key, in your case ε=0.001 so the theoretical limit is a data structure of 99658 bits per key, or 10 bits per element, which would be 298,974,000,000 bits or 38 GB.

So 30GB is below the theoretical limit for a data structure with the performance and number of entries that you require, but within the ball park.

like image 70
Pete Kirkham Avatar answered Oct 13 '22 00:10

Pete Kirkham