Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

word distribution problem

Tags:

algorithm

I have a big file of words ~100 Gb and have limited memory 4Gb. I need to calculate word distribution from this file. Now one option is to divide it into chunks and sort each chunk and then merge to calculate word distribution. Is there any other way it can be done faster? One idea is to sample but not sure how to implement it to return close to correct solution.

Thanks

like image 770
user352951 Avatar asked Feb 21 '26 01:02

user352951


1 Answers

You can build a Trie structure where each leaf (and some nodes) will contain the current count. As words will intersect with each other 4GB should be enough to process 100 GB of data.

like image 121
Vitalii Fedorenko Avatar answered Feb 24 '26 01:02

Vitalii Fedorenko