Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hazelcast and MapDB - implementing a simple distributed database

I have implemented a hazelcast service which stores its data into local mapdb instances via MapStoreFactory and newMapLoader. This way the keys can be loaded if a cluster restart is necessary:

public class HCMapStore<V> implements MapStore<String, V> {

Map<String, V> map;

/** specify the mapdb e.g. via 
  * DBMaker.newFileDB(new File("mapdb")).closeOnJvmShutdown().make() 
  */
public HCMapStore(DB db) {
    this.db = db;
    this.map = db.createHashMap("someMapName").<String, Object>makeOrGet();
}

// some other store methods are omitted
@Override
public void delete(String k) {
    logger.info("delete, " + k);
    map.remove(k);
    db.commit();
}

// MapLoader methods
@Override
public V load(String key) {
    logger.info("load, " + key);
    return map.get(key);
}

@Override
public Set<String> loadAllKeys() {
    logger.info("loadAllKeys");
    return map.keySet();
}

@Override
public Map<String, V> loadAll(Collection<String> keys) {
    logger.info("loadAll, " + keys);
    Map<String, V> partialMap = new HashMap<>();
    for (String k : keys) {
        partialMap.put(k, map.get(k));
    }
    return partialMap;
}}

The problem I'm now facing is that the loadAllKeys method of the MapLoader interface from hazelcast requires to return ALL keys of the whole cluster BUT every node stores ONLY the objects it owns.

Example: I have two nodes and store 8 objects, then e.g. 5 objects are stored in the mapdb of node1 and 3 in the mapdb of node2. Which object is owned by which node is decided by hazelcast I think. Now on restart node1 will return 5 keys for loadAllKeys and node2 will return 3. Hazelcast decides to ignore the 3 items and data is 'lost'.

What could be a good solution to this?

Update for bounty: Here I asked this on the hc mailing list mentioning 2 options (I'll add 1 more) and I would like to know if something like this is already possible with hazelcast 3.2 or 3.3:

  1. Currently the MapStore interface gets only data or updates from the local node. Would it be possible to notify the MapStore interface of every storage action of the full cluster? Or maybe this is already possible with some listener magic? Maybe I can force hazelcast to put all objects into one partition and have 1 copy on every node.

  2. If I restart e.g. 2 nodes then the MapStore interface gets called correctly with my local databases for node1 and then for node2. But when both nodes join the data of node2 will be removed as Hazelcast assumes that only the master node can be correct. Could I teach hazelcast to accept the data from both nodes?

like image 844
Karussell Avatar asked Sep 29 '22 23:09

Karussell


1 Answers

According to Hazelcast 3.3 documentation the MapLoader initialization flow is the following:

When getMap() is first called from any node, initialization will start depending on the the value of InitialLoadMode. If it is set as EAGER, initialization starts. If it is set as LAZY, initialization actually does not start but data is loaded at each time a partition loading is completed.

  1. Hazelcast will call MapLoader.loadAllKeys() to get all your keys on each node
  2. Each node will figure out the list of keys it owns
  3. Each node will load all its owned keys by calling MapLoader.loadAll(keys)
  4. Each node puts its owned entries into the map by calling IMap.putTransient(key,value)

The above implies that if nodes start up in a different order then the keys will get distributed differently too. Thus, each node won't find all/some of the assigned keys in its local store. You should be able to verify it by setting breakpoints in your HCMapStore.loadAllKeys and HCMapStore.loadAll and compare the keys you re get retrieved with the keys that.

In my opinion, what you are trying to achieve contradicts the concept of distributed cache with resilience characteristics like Hazelcast and therefore is impossible. I.e. when one node goes away (fails or disconnects for whatever reason) the cluster will rebalance by moving parts of data around, the same process will happen every time a node joins a cluster. So, in case of cluster changes the local backstore of the lost node becomes out-of-date.

Hazelcast cluster is dynamic by nature, hence it can't rely on backstore with static distributed topology. Essentially, you need to have a shared backstore to make it work with dynamic hazelcast cluster. The backstore can be distributed as well, e.g. cassandra, but its topology must be independent from cache cluster topology.

UPDATE: It seems to me that what you are trying to achieve is more logical in the form of a distributed datastore (on top of MapDB) with local caching.

I hope this helps.

like image 130
Vlad Avatar answered Oct 20 '22 10:10

Vlad