Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

'Big dictionary' implementation in Java

I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.

Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.

Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.

like image 898
tsotsi Avatar asked Sep 29 '14 20:09

tsotsi


2 Answers

Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.

What it does useful for your task out of the box:

  • Persistance to disk via memory mapped files (see comment by Michał Kosmulski)
  • Lazy load (disk pages are loaded only on demand) -> fast startup
  • If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
  • Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
  • Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
  • int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
  • Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
  • Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.
like image 196
leventov Avatar answered Oct 26 '22 23:10

leventov


At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.

As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.

Please let me know if you have any questions!

like image 43
Devarsh Desai Avatar answered Oct 26 '22 22:10

Devarsh Desai