Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count string num with limit memory?

Tags:

java

algorithm

The task is to count the num of words from a input file.

the input file is 8 chars per line, and there are 10M lines, for example:

aaaaaaaa  
bbbbbbbb  
aaaaaaaa  
abcabcab  
bbbbbbbb  
...

the output is:

aaaaaaaa 2  
abcabcab 1  
bbbbbbbb 2  
...

It'll takes 80MB memory if I load all of words into memory, but there are only 60MB in os system, which I can use for this task. So how can I solve this problem?

My algorithm is to use map<String,Integer>, but jvm throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. I know I can solve this by setting -Xmx1024m, for example, but I want to use less memory to solve it.

like image 478
NOrder Avatar asked Apr 12 '12 09:04

NOrder


People also ask

How much memory does an integer string object take?

A char is 2 bytes, and an int is 4 bytes.

Can string store spaces in Java?

But these memory spaces are used for different purposes. Stack space contains specific values that are short-lived whereas Heap space used by Java Runtime to allocate memory to objects and JRE classes. In Java, strings are stored in the heap area.

What is the size of String object in Java?

Therefore, the maximum length of String in Java is 0 to 2147483647. So, we can have a String with the length of 2,147,483,647 characters, theoretically.


2 Answers

I believe that the most robust solution is to use the disk space.

For example you can sort your file in another file, using an algorithm for sorting large files (that use disk space), and then count the consecutive occurrences of the same word.

I believe that this post can help you. Or search by yourself something about external sorting.

Update 1

Or as @jordeu suggest you can use a Java embedded database library: like H2, JavaDB, or similars.

Update 2

I thought about another possible solution, using Prefix Tree. However I still prefer the first one, because I'm not an expert on them.

like image 89
dash1e Avatar answered Sep 19 '22 14:09

dash1e


Read one line at a time and then have e.g. a HashMap<String,Integer> where you put your words as key and the count as integer.

If a key exists, increase the count. Otherwise add the key to the map with a count of 1.

There is no need to keep the whole file in memory.

like image 26
Heiko Rupp Avatar answered Sep 20 '22 14:09

Heiko Rupp