Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to store sparse adjacency matrix

I've read several topics, but I'm lost. I'm quite new to this. I want to store huge sparse matrix and have several idea's but can choose between them. Here's my needs:

  1. Adjacency matrix of approx. 50 million vertices.
  2. Maximum amount of neighbors per one vertex - approx. 10 000.
  3. Average amount of neighbors per one vertex - approx. 200-300.
  4. Fast row query - vector will be multiplied by this matrix.
  5. O(1) complexity to add edge.
  6. Most probably, edges will not be deleted.
  7. Enumeration of the vertices adjacent to v - as fast as possible.
  8. Portability - there must be a way to transfer base from one computer to another.

So, here's my ideas:

  1. Huge table with pairs (row, col). Very simple, but enumeration of vertices will be at least O(log N), where N - size of table. It's quite slow as I think. Also, it must be indexed. Every RDBMS will be good for what.
  2. Enormous amount of lists: one list per vertex. Very fast enumeration, but wouldn't it take much resources to storage this? Also, I'm not sure about which DBMS to use in this case: maybe some NoSql?
  3. Huge table (row | set of cols). Combination of two above. I'm not sure is there any RDBMS to support arbitrary sets. Do you know any? Maybe NoSql will be useful here?
  4. Collection of adjacency lists. Any RDBMS will be suitable for that, and costs in terms of complexity are good, but they can be killed by multiple request to DB for one vertex.
  5. HDF5 - I think it will be slow due to I/O.
  6. Neo4j - As far as I understand, it storages data in double-linked lists, so it will be practically the same as №4, am i right?

Please, help me to choose or offer a better decision.

If I'm wrong with estimates somewhere, please correct me.

like image 915
ov7a Avatar asked Feb 18 '23 10:02

ov7a


2 Answers

A hybrid neo4j / hbase approach may work well in which neo4j optimizes the graph processing aspects while hbase does the heavy lifting scalability wise - e.g for storing lots of extra attributes.

neo4j contains the nodes and relationships. It may well be enough scalability wise . My investigation on the web on independent non-neo4j sites claim up to several billion nodes/relationships on a single machine with couple of orders of magnitude better performance on traversal than RDBMS.

But.. in case more scalability were needed, you can bring in the hbase big iron to store non-relationship/node identifier extra attributes. Then simply add the hbase rowkey into the neo4j node info for lookup purposes when needed by application.

like image 68
WestCoastProjects Avatar answered Feb 20 '23 00:02

WestCoastProjects


In the end, I've implemented solution number one.

I used PostgreSQL with two tables: one for edges with two columns - start/end, and another for vertices with unique serial for vertex number and some columns for vertex description.

I've implemented upsert based on pg_advisory_xact_lock. It was a bit slow, but it was enough for me.

Also, it's a pain to delete vertex from this configuration.

To speed up multiplication, I've exported edges table to file. It can even be placed in RAM on x64 machine.

To be fair, the amount of data was less than I expected. Instead of 50 million vertices and average 200-300 edges for 1 vertex there were only 7 million vertices and 160 million edges total.

like image 23
ov7a Avatar answered Feb 19 '23 23:02

ov7a