Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Probabilistic Counting in Python

I have a 50gb txt file of random strings , out of which I want to count the number of occurrences of a substring in that file.. many times, for different not predefined random substrings.

I was wondering if there is another way to approach the problem.

probabilistic way

Something like a bloom filter , but instead of probabilistic membership check, we could have probabilistic counting. That data structure would be used for count estimations.

Other statistical method(?)

Any dummy method that I could use to estimate the number of occurrences of a string in a text file ? Open to alternatives.

It would be nice if it could be done in <= logarithmic time as I will be doing the same task a lot of times.

like image 737
RetroCode Avatar asked Oct 29 '22 17:10

RetroCode


2 Answers

Some streaming algorithms sound relevant to this problem, either separately, or in conjunction with each other.

  1. An initial pass on the file could give an approximation of the heavy hitters. Depending on your problem, it's possible that the heavy hitters' distribution is sufficient for you, but that this set is small enough to hold in memory. If this is the case, you could perform a second pass, counting only the heavy hitters from the first pass.

  2. The count-min sketch data structure can perform approximate counting. You could either use this data structure on its own, or you could use it for counting the occurrences of the heavy hitters.

Since this is tagged as Python:

  • StreamLib

  • count-min-sketch libraries on PyPI

like image 134
Ami Tavory Avatar answered Nov 15 '22 05:11

Ami Tavory


You could compute a suffix array for your file.

This array contains the starting positions of suffixes in sorted order. With 50GB of text, you could allocate 5 bytes per position and end up with a suffix array of 5*50=250 GBytes. If this is too much, then you could try a compressed suffix array.

Computing this array can be done in O(n) (probably a few hours with an appropriate algorithm, mostly limited by disk read/write speed).

Once you have the array, you can count the number of occurrences of any substring in logarithmic time. In practice the time would be dominated by the seek times to different parts of your disk, so this part will be much faster if you store the files on a solid state drive.

like image 44
Peter de Rivaz Avatar answered Nov 15 '22 07:11

Peter de Rivaz