Just started my excursion to graph processing methods and tools. What we basically do - count some standard metrics like pagerank, clustering coefficient, triangle count, diameter, connectivity etc. In the past was happy with Octave, but when we started to work with graphs having let's say 10^9 nodes/edges we stuck.
So the possible solutions can be distributed cloud made with Hadoop/Giraph, Spark/GraphX, Neo4j on top of them, etc.
But since I am a beginner, can someone advise what actually to choose? I did not get the difference when to use Spark/GraphX and when Neo4j? Right now I consider Spark/GraphX, since it have more Python alike syntax, while neo4j has the own Cypher. Visualization in neo4j is cool but not useful in such a large scale. I do not understand is there a reason to use additional level of software (neo4j) or just use Spark/GraphX? Since I understood neo4j will not save so much time like if we worked with pure hadoop vs Giraph or GraphX or Hive.
Thank you.
GraphX is the Spark API for graphs and graph-parallel computation. It includes a growing collection of graph algorithms and builders to simplify graph analytics tasks. GraphX extends the Spark RDD with a Resilient Distributed Property Graph.
Since the Neo4j graph database is not a triple store, it is not equipped with a SPARQL query engine.
GraphX allows graphs to be read from Hive using an SQL-like query, allowing arbitrary column transformations. Giraph requires extra programming effort for such preprocessing tasks. Interacting with data in Spark (e.g., checking the output of a job) is convenient, though mostly when experimenting with small graphs.
Neo4J: It is a graphical database which helps out identifying the relationships and entities data usually from the disk. It's popularity and choice is given in this link. But when it needs to process the very large data-sets and real time processing to produce the graphical results/representation it needs to scale horizontally. In this case combination of Neo4J with Apache Spark will give significant performance benefits in such a way Spark will serve as an external graph compute solution.
Mazerunner is a distributed graph processing platform which extends Neo4J. It uses message broker to process distribute graph processing jobs to Apache Spark GraphX module.
GraphX: GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. It supports multiple Graph algorithms.
Conclusion: It is always recommended to use the Hybrid combination of Neo4j with GraphX as they both easier to integrate.
For real time processing and processing large data-sets, use neo4j with GraphX.
For simple persistence and to show the entity relationship for a simple graphical display representation use standalone neo4j.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With