Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load Spark Cassandra Connector in the shell?

I am trying to use Spark Cassandra Connector in Spark 1.1.0.

I have successfully built the jar file from the master branch on GitHub and have gotten the included demos to work. However, when I try to load the jar files into the spark-shell I can't import any of the classes from the com.datastax.spark.connector package.

I have tried using the --jars option on spark-shell and adding the directory with the jar file to Java's CLASSPATH. Neither of these options work. In fact, when I use the --jars option, the logging output shows that the Datastax jar is getting loaded, but I still cannot import anything from com.datastax.

I have been able to load the Tuplejump Calliope Cassandra connector into the spark-shell using --jars, so I know that's working. It's just the Datastax connector which is failing for me.

like image 693
egerhard Avatar asked Sep 14 '14 19:09

egerhard


People also ask

How do I connect Spark to Cassandra?

To connect Spark to a Cassandra cluster, the Cassandra Connector will need to be added to the Spark project. DataStax provides their own Cassandra Connector on GitHub and we will use that. This should output compiled jar files to the directory named “target”. There will be two jar files, one for Scala and one for Java.

How does Spark work with Cassandra?

How does it work? The fundamental idea is quite simple: Spark and Cassandra clusters are deployed to the same set of machines. Cassandra stores the data; Spark worker nodes are co-located with Cassandra and do the data processing. Spark is a batch-processing system, designed to deal with large amounts of data.

Can Cassandra and Spark run on the same cluster?

This is a similar setup used in Cassandra database clusters, so these types of clusters can run Spark + Cassandra on the same machine types using Cassandra instead of HDFS for storage.

What is Cassandra connector?

The Spark Cassandra Connector Java API allows you to create Java applications that use Spark to analyze database data. The Spark Cassandra Connector Java API allows you to create Java applications that use Spark to analyze database data. See the Spark Cassandra Connector Java Doc on GitHub.


Video Answer


2 Answers

I got it. Below is what I did:

$ git clone https://github.com/datastax/spark-cassandra-connector.git $ cd spark-cassandra-connector $ sbt/sbt assembly $ $SPARK_HOME/bin/spark-shell --jars ~/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/connector-assembly-1.2.0-SNAPSHOT.jar  

In scala prompt,

scala> sc.stop scala> import com.datastax.spark.connector._ scala> import org.apache.spark.SparkContext scala> import org.apache.spark.SparkContext._ scala> import org.apache.spark.SparkConf scala> val conf = new SparkConf(true).set("spark.cassandra.connection.host", "my cassandra host") scala> val sc = new SparkContext("spark://spark host:7077", "test", conf) 
like image 131
Lishu Avatar answered Sep 30 '22 21:09

Lishu


Edit: Things are a bit easier now

For in-depth instructions check out the project website https://github.com/datastax/spark-cassandra-connector/blob/master/doc/13_spark_shell.md

Or feel free to use Spark-Packages to load the Library (Not all versions published) http://spark-packages.org/package/datastax/spark-cassandra-connector

> $SPARK_HOME/bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3-s_2.10 

The following assumes you are running with OSS Apache C*

You'll want to start the class with the –driver-class-path set to include all your connector libs

I'll quote a blog post from the illustrious Amy Tobey

The easiest way I’ve found is to set the classpath with then restart the context in the REPL with the necessary classes imported to make sc.cassandraTable() visible. The newly loaded methods will not show up in tab completion. I don’t know why.

  /opt/spark/bin/spark-shell --driver-class-path $(echo /path/to/connector/*.jar |sed 's/ /:/g') 

It will print a bunch of log information then present scala> prompt.

scala> sc.stop 

Now that the context is stopped, it’s time to import the connector.

scala> import com.datastax.spark.connector._ scala> val conf = new SparkConf() scala> conf.set("cassandra.connection.host", "node1.pc.datastax.com") scala> val sc = new SparkContext("local[2]", "Cassandra Connector Test", conf) scala> val table = sc.cassandraTable("keyspace", "table") scala> table.count 

If you are running with DSE < 4.5.1

There is a slight issue with the DSE Classloader and previous package naming conventions that will prevent you from finding the new spark-connector libraries. You should be able to get around this by removing the line specifying the DSE Class loader in the scripts starting spark-shell.

like image 38
RussS Avatar answered Sep 30 '22 21:09

RussS