Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Project: Make HashMap (including Load-Store) Performance Better

I am trying to code for our server in which I have to find users access type by URL.

Now, at the beginning, we see 100 millions distinct URL's are accessed per day. Now, by the time going it became nearly 600 millions distinct URL's per day.

For 100 millions, what we did is following:

1) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.

2) Then search the HashMap to find how many time URL accessed.

Now, as the HashTable become larger, what we did is following:

1) Build two/three separate HashTable, and load and store it (on general file system) to find how many times URL accessed.

Now, issue is,

1) Though the HashTable performance is quite nice, code takes more time while loading/storing HashTable (we are using File Channel, takes 16-19 seconds to load/store HashTable - 200 millions entry- as load factor is 0.5)

What we are trying to ask is:

1) Any comment how to solve this issue ?

2) How to reduce load/store time (I asked before but seems File Channel is the best way) ?

3) Is storing a large HashTable (more than memory) and caching it repeatedly will be a nice solution ? If so, how to do that (at least some pointers). We tried it by using

RandomAccessFile raf = new RandomAccessFile("array.dat", "rw");
IntBuffer map = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, 1 << 30).order(ByteOrder.nativeOrder()).asIntBuffer();

However, gives worser performance than previous.

Thanks.

NB:

1) As per previous suggestions of Stack Overflow, we use some NoSQL DB like TokyoCabinet but from our experience, a custom HashTable gives better performance than it on 100 millions key-value pairs.

2) Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.

What We forgot to mention is:

1) As our application is a part of project and to be applied on a small campus, so we assume URL accessed is not more than 800 million. So, you can think 600/700 data value is fixed.

2) Our main concern is performance.

3) We have to run our application locally.

Edit: code of our hashmap can be found here.

like image 882
Arpssss Avatar asked Mar 03 '26 18:03

Arpssss


1 Answers

It might be best to access the table as a memory-mapped buffer. That way, you could simply implement random access to the file, without worrying about loading and storing, and leave caching to the operating system. I see that your current implementation already does use memory-mapped access for reading and writing, but it still loads things into the java heap in between. Avoid this data duplication and copying! Treat the backing file itself as the data structure, and only access the portions of it that you actually need, only when you need them.

Within that file, hash maps will work if you are really really sure that hash collisions are not an issue. Otherwise I'd go for a B+ tree there, with nodes about the size of your hard disk pages. That way, each disk access will yield a lot more of usable data than just a single key, thus resulting in a more shallow tree and less individual disc operations.

I guess others will have implemented stuff like this, but if you prefer your own hash map implementation, you might prefer to write your own memory-mapped B+ trees as well.

like image 186
MvG Avatar answered Mar 06 '26 07:03

MvG