Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory-mapped files: pros and cons?

I need to share data between two Java applications running on the same machine (two different JVMs). I precise that the data to be shared is large (about 7 GB). The applications must access the data very fast because they have to answer incoming queries at a very high rate. I don't want the applications to hold each one a copy of the data.

I've seen that one option is to use memory-mapped files. Application A gets the data from somewhere (let's say a database) and stores it in files. Then application B may access these files using java.nio. I don't know exactly how memory-mapped files work, I only know that the data is stored in a file and that this file (or a part of it) is mapped to a region of the memory (virtual memory?). So, the two applications can read-write the data in memory and the changes are automatically (I guess?) committed to the file. I also don't know if there is a maximum size for a file to be entirely mapped in memory.

My first question is what are the different possibilities for two applications to share data in this scenario (I mean taking into account that the amount of data is very large and that access to this data must be very fast)? I precise that this question is not related to memory-mapped I/O, it just to know what are the other ways to solve the same problem.

My second question is what are the pros and cons of using memory-mapped files?

Thanks

like image 415
manash Avatar asked Dec 15 '11 21:12

manash


2 Answers

My first question is what are the different possibilities for two applications to share data?

As S.Lott points out, there's a lot of mechanisms:

  • OS-level message queues
  • OS-level POSIX shared memory segments (persist after process death)
  • OS-level memory mappings (could be anonymous or file-backed)
  • OS-level anonymous pipes (unidirectional)
  • OS-level named pipes (unidirectional)
  • OS-level sockets (bidirectional) -- whether AF_UNIX or AF_INET or AF_INET6
  • OS-level shared global memory -- suitable for multi-threaded programs
  • Storing data in files
  • Application-level message queues
  • Application-level blackboard-style tuplespaces
  • Application-level key/value stores
  • Application-level remote procedure call frameworks -- many are available
  • Application-level web-based frameworks

My second question is what are the pros and cons of using memory-mapped files?

Pros:

  • very fast -- depending upon how you access the data, potentially zero-copy mechanisms can be used to operate directly on the data with no speed penalties. Care must be taken to update objects in a consistent manner.
  • should be very portable -- available on Unix systems for probably 25 years (give or take), and apparently Windows has mechanisms too.

Cons:

  • Single-system sharing. If you want to distribute your application over multiple machines, shared memory isn't a great option. Distributed shared memory systems are available, but they feel very much like the wrong interface to my way of thinking.
  • Even on a single system, if the memory is located on a single NUMA node but needed to be accessed by processors from multiple nodes, the inter-node requests may significantly slow processing compared to giving each node their own segment of the memory.
  • You can't just store pointers -- everything must be stored as offsets to base addresses, because the memory may be mapped at different locations in different processes. I have no idea what this means for Java objects, though presumably someone smart did their best to make it transparent to Java programmers. If you're not using their provided mechanisms, then you probably must do the work yourself. (Without actual pointers in Java, perhaps this is not very onerous.)
  • Updating objects consistently has proven to be very difficult. Passing immutable objects in message-passing systems instead generally results in programs with fewer concurrency bugs. (Concurrent programming in Erlang feels very natural and straight-forward. Concurrent programming in more imperative languages tends to introduce a huge pile of new concurrency controls: semaphores, mutexes, spinlocks, monitors).
like image 95
sarnold Avatar answered Nov 07 '22 12:11

sarnold


Memory mapped files sounds like a headache. A simple option and less error prone would be to use a shared database with a cluster aware cache. That way only writes go down to the database and reads can be served from the cache.

As an example of how to do this in hibernate see http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html#performance-cache

like image 21
Lionel Port Avatar answered Nov 07 '22 11:11

Lionel Port