word distribution problem

Question

I have a big file of words ~100 Gb and have limited memory 4Gb. I need to calculate word distribution from this file. Now one option is to divide it into chunks and sort each chunk and then merge to calculate word distribution. Is there any other way it can be done faster? One idea is to sample but not sure how to implement it to return close to correct solution.

Thanks

Vitalii Fedorenko · Accepted Answer

You can build a Trie structure where each leaf (and some nodes) will contain the current count. As words will intersect with each other 4GB should be enough to process 100 GB of data.

word distribution problem

Tags:

algorithm

user352951

1 Answers

Vitalii Fedorenko

Recent Activity

Donate For Us

word distribution problem

Tags:

algorithm

user352951

1 Answers

Vitalii Fedorenko

Related questions

Recent Activity

Donate For Us