Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setup and configuration of JanusGraph for a Spark cluster and Cassandra

I am running JanusGraph (0.1.0) with Spark (1.6.1) on a single machine. I did my configuration as described here. When accessing the graph on the gremlin-console with the SparkGraphComputer, it is always empty. I cannot find any error in the logfiles, it is just an empty graph.

Is anyone using JanusGraph with Spark and can share his configuration and properties?

Using a JanusGraph, I get the expected Output:

gremlin> graph=JanusGraphFactory.open('conf/test.properties')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
gremlin> g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[cassandrathrift:[127.0.0.1]], standard]
gremlin> g.V().count()
14:26:10 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>1000001
gremlin>

Using a HadoopGraph with Spark as GraphComputer, the graph is empty:

gremlin> graph=GraphFactory.open('conf/test.properties')
==>hadoopgraph[cassandrainputformat->gryooutputformat]
gremlin> g=graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cassandrainputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
            ==>0==============================================>   (14 + 1) / 15]

My conf/test.properties:

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output

#
# Titan Cassandra InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=cassandrathrift
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.keyspace=janusgraph
storage.backend=cassandrathrift
storage.hostname=127.0.0.1
storage.keyspace=janusgraph

#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=janusgraph
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647

#
# SparkGraphComputer Configuration
#
spark.master=spark://127.0.0.1:7077
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.executor.memory=100g

gremlin.spark.persistContext=true
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer

HDFS seems to be configured correctly as described here:

gremlin> hdfs
==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_178390072_1, ugi=cassandra (auth:SIMPLE)]]]
like image 560
Felix Hill Avatar asked May 05 '17 12:05

Felix Hill


People also ask

Can I run JanusGraph on Cassandra and Elasticsearch?

A tutorial for running JanusGraph on Cassandra and Elasticsearch using Elassandra, and then visualizing the graph using Graphexp. This will create files and potentially overwrite files on your box. Run with caution. I think only is a problem if you have an existing janus-graph-0.5.2 installation and it's at ~/lib/janusgraph-0.5.2 .

Should you run spark and Cassandra in the same cluster?

Running on-prem you may get better deals on high end servers, so in this case, you should consider running Spark and Cassandra in the same cluster for high performance computing. Regardless where you run your workloads, you have two approaches that you can use to integrate Spark and Cassandra.

How to change the default seed list in a Cassandra cluster?

In general, all nodes in a cluster have the same seed list. To change the configuration setting follow steps – open cassandra.yaml file. use the command ctrl+f to search in a file. Search for seeds. You will see the following property in file seeds: “127.0.0.1”. It is default setting for the cluster node.

What is a node in a Cassandra cluster?

A node in the cluster contains keyspaces, tables, schema information, etc. In Cassandra, cassandra.yaml is the main configuration file in which we can change the default setting as per requirements and after any changes in cassandra.yaml file you must remember to restart the node to take effect.


Video Answer


1 Answers

Try fixing these properties:

janusgraphmr.ioformat.conf.storage.keyspace=janusgraph
storage.keyspace=janusgraph

Replace with:

janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
storage.cassandra.keyspace=janusgraph

The default keyspace name is janusgraph, so despite the mistakes on the property names, I don't think you would have observed that problem unless you loaded your data using a different keyspace name.

The latter property is described in the Configuration Reference. Also, keep an eye on this open issue to improve the docs for Hadoop-Graph usage.

like image 183
Jason Plurad Avatar answered Oct 16 '22 12:10

Jason Plurad