The task is to count the num of words from a input file.
the input file is 8 chars per line, and there are 10M lines, for example:
aaaaaaaa
bbbbbbbb
aaaaaaaa
abcabcab
bbbbbbbb
...
the output is:
aaaaaaaa 2
abcabcab 1
bbbbbbbb 2
...
It'll takes 80MB memory if I load all of words into memory, but there are only 60MB in os system, which I can use for this task. So how can I solve this problem?
My algorithm is to use map<String,Integer>
, but jvm throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. I know I can solve this by setting -Xmx1024m, for example, but I want to use less memory to solve it.
A char is 2 bytes, and an int is 4 bytes.
But these memory spaces are used for different purposes. Stack space contains specific values that are short-lived whereas Heap space used by Java Runtime to allocate memory to objects and JRE classes. In Java, strings are stored in the heap area.
Therefore, the maximum length of String in Java is 0 to 2147483647. So, we can have a String with the length of 2,147,483,647 characters, theoretically.
I believe that the most robust solution is to use the disk space.
For example you can sort your file in another file, using an algorithm for sorting large files (that use disk space), and then count the consecutive occurrences of the same word.
I believe that this post can help you. Or search by yourself something about external sorting.
Update 1
Or as @jordeu suggest you can use a Java embedded database library: like H2, JavaDB, or similars.
Update 2
I thought about another possible solution, using Prefix Tree. However I still prefer the first one, because I'm not an expert on them.
Read one line at a time
and then have e.g. a HashMap<String,Integer>
where you put your words as key and the count as integer.
If a key exists, increase the count. Otherwise add the key to the map with a count of 1.
There is no need to keep the whole file in memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With