I've read several topics, but I'm lost. I'm quite new to this. I want to store huge sparse matrix and have several idea's but can choose between them. Here's my needs:
So, here's my ideas:
Please, help me to choose or offer a better decision.
If I'm wrong with estimates somewhere, please correct me.
A hybrid neo4j / hbase approach may work well in which neo4j optimizes the graph processing aspects while hbase does the heavy lifting scalability wise - e.g for storing lots of extra attributes.
neo4j contains the nodes and relationships. It may well be enough scalability wise . My investigation on the web on independent non-neo4j sites claim up to several billion nodes/relationships on a single machine with couple of orders of magnitude better performance on traversal than RDBMS.
But.. in case more scalability were needed, you can bring in the hbase big iron to store non-relationship/node identifier extra attributes. Then simply add the hbase rowkey into the neo4j node info for lookup purposes when needed by application.
In the end, I've implemented solution number one.
I used PostgreSQL with two tables: one for edges with two columns - start/end, and another for vertices with unique serial for vertex number and some columns for vertex description.
I've implemented upsert based on pg_advisory_xact_lock. It was a bit slow, but it was enough for me.
Also, it's a pain to delete vertex from this configuration.
To speed up multiplication, I've exported edges table to file. It can even be placed in RAM on x64 machine.
To be fair, the amount of data was less than I expected. Instead of 50 million vertices and average 200-300 edges for 1 vertex there were only 7 million vertices and 160 million edges total.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With