I have been playing around with using graphs to analyze big data. Its been working great and really fun but I'm wondering what to do as the data gets bigger and bigger?
Let me know if there's any other solution but I thought of trying Hbase because it scales horizontally and I can get hadoop to run analytics on the graph(most of my code is already written in java), but I'm unsure how to structure a graph on a nosql database? I know each node can be an entry in the database but I'm not sure how to model edges and add properties to them(like name of nodes, attributes, pagerank, weights on edges,etc..).
Seeing how hbase/hadoop is modeled after big tables and map reduce I suspect there is a way to do this but not sure how. Any suggestions?
Also, does this make sense what I'm trying to do? or is it there better solutions for big data graphs?
HGraphDB is a client layer for using HBase as a graph database. It is an implementation of the Apache TinkerPop 3 interfaces.
Most graph database systems store data in a structure similar to linked lists. They store direct links to data which is connected, rather than similar objects.
MongoDB as a Graph Database. MongoDB offers graphing capabilities with its $graphLookup stage. Give $graphLookup a try by creating a free cluster in MongoDB Atlas. Graph databases fulfill a need that traditional databases have left unmet: They prioritize relationships between entities.
There are two popular models of graph databases: property graphs and RDF graphs. The property graph focuses on analytics and querying, while the RDF graph emphasizes data integration. Both types of graphs consist of a collection of points (vertices) and the connections between those points (edges).
You can store an adjacency list in HBase/Accumulo in a column oriented fashion. I'm more familiar with Accumulo (HBase terminology might be slightly different) so you might use a schema similar to:
SrcNode(RowKey) EdgeType(CF):DestNode(CFQ) Edge/Node Properties(Value)
Where CF=ColumnFamily and CFQ=ColumnFamilyQualifier
You might also store node/vertex properties as separate rows using something like:
Node(RowKey) PropertyType(CF):PropertyValue(CFQ) PropertyValue(Value)
The PropertyValue could be either in the CFQ or the Value
From a graph processing perspective as mentioned by @Arnon Rotem-Gal-Oz you could look at Apache Giraph which is an implementation of Google Pregel. Pregel is the method Google use for large graph processing.
Using HBase/Accumulo as input to giraph has been submitted recently (7 Mar 2012) as a new feature request to Giraph: HBase/Accumulo Input and Output formats (GIRAPH-153)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With