Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large number of Object in Java (with HashMap)

Hello,

I'm currently working on a word prediction in Java. For this, I'm using a NGram based model, but I have some memory issues...

In a first time, I had a model like this :

public class NGram implements Serializable {
    private static final long serialVersionUID = 1L;

    private transient int count;
    private int id;
    private NGram next;

    public NGram(int idP) {
        this.id = idP;
    }
}

But it's takes a lot of memory, so I thought I need optimization, and I thought, if I have "hello the world" and "hello the people", instead of get two ngram, I could keep in one that keep "Hello the" and then have two possibilty : "people" and "world".

To be more clear, this is my new model :

public class BNGram implements Serializable {
    private static final long serialVersionUID = 1L;
    private int id;
    private HashMap<Integer,BNGram> next;
    private int count = 1;

    public BNGram(int idP) {
        this.id = idP;
        this.next = new HashMap<Integer, BNGram>();
    }
}

But it seems that my second model consume twice more memory... I think it's because of HashMap, but I don't how to reduce this? I tried to use different Map implementations like Trove or others, but it don't change any thing.

To give you a idea, for a text of 9MB with 57818 different word (different, but it's not the total number of word), after NGram generation, my javaw process consume 1.2GB of memory... If I save it with GZIPOutputStream, it takes arround 18MB on the disk.

So my question is : how can I do to use less memory ? Can I make something with compression (as the Serialization). I need to add this to a other application, so I need to reduce the memory usage before...

Thanks a lot, and sorry for my bad english...

ZiMath

like image 312
Mathieu THEBAUD Avatar asked Mar 21 '15 12:03

Mathieu THEBAUD


2 Answers

You need a specialized structure to achieve what you want.

Take a look at Apache's PatriciaTrie. It's like a Map, but it's memory-wise and works with Strings. It's also extremely fast: operations are O(k), with k being the number of bits of the largest key.

It has an operation that suits your immediate needs: prefixMap(), which returns a SortedMap view of the trie that contains Strings which are prefixed by the given key.

A brief usage example:

public class Patricia {

    public static void main(String[] args) {

        PatriciaTrie<String> trie = new PatriciaTrie<>();

        String world = "hello the world";
        String people = "hello the people";

        trie.put(world, null);
        trie.put(people, null);

        SortedMap<String, String> map1 = trie.prefixMap("hello");
        System.out.println(map1.keySet());  // [hello the people, hello the world]

        SortedMap<String, String> map2 = trie.prefixMap("hello the w");
        System.out.println(map2.keySet()); // [hello the world]

        SortedMap<String, String> map3 = trie.prefixMap("hello the p");
        System.out.println(map3.keySet());  // [hello the people]
    }
}

There are also the tests, which contain more examples.

like image 99
fps Avatar answered Oct 01 '22 20:10

fps


Here, I'm primarily trying to explain why you are observing such an excessive memory consumption, and what you could do about this (if you wanted to stick to the HashMap) :

A HashMap that is created with the default constructor will have an initial capacity of 16. This means that it will have space for 16 entries, even if it is empty. Additionally, you seem to create the map, regardless of whether it is needed or not.

So way to reduce the memory consumption in your case would be to

  • Create the map only when it is necessary
  • Create it with a smaller initial capacity

Applied to your class, this could roughly look like this:

public class BNGram {
    private int id;
    private Map<Integer,BNGram> next;

    public BNGram(int idP) {
        this.id = idP;
        // (Do not create a new `Map` here!)
    }

    void doSomethingWhereTheMapIsNeeded(Integer key, BNGram value) {

        // Create a map, when required, with an initial capacity of 1
        if (next == null) {
            next = new HashMap<Integer, BNGram>(1);
        }
        next.put(key, value);
    }
}

But...

... conceptually, it is questionable to have a large "tree" structure consisting of many, many maps, each only with "few" entries. This suggests that a different data structure is more appropriate here. So you should definitely prefer a solution like the one in the answer by Magnamag, or (if this is not applicable for you, as suggested in your comments), look out for an alternative data structure - maybe even by formulating this as a new question that does not suffer from the XY Problem.

like image 30
Marco13 Avatar answered Oct 01 '22 21:10

Marco13