Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to enable streaming from Cassandra to Spark?

I have the following spark job:

from __future__ import print_function

import os
import sys
import time
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming,CassandraSparkContext

if __name__ == "__main__":

    conf = SparkConf().setAppName("PySpark Cassandra Test")
    sc = CassandraSparkContext(conf=conf)
    stream = StreamingContext(sc, 2)

    rdd=sc.cassandraTable("keyspace2","users").collect()
    #print rdd
    stream.start()
    stream.awaitTermination()
    sc.stop() 

When I run this, it gives me the following error:

ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: \
No output operations registered, so nothing to execute

the shell script I run:

./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example
s/src/main/python/test/reading-cassandra.py

Comparing spark streaming with kafka, I have this line missing from the above code:

kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1})

where I'm actually using createStream but for cassandra, I can't see anything like this on the docs. How do I start the streaming between spark streaming and cassandra?

Versions:

Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
like image 453
HackCode Avatar asked Jan 26 '16 12:01

HackCode


People also ask

How do I connect my Spark and Cassandra?

To connect Spark to a Cassandra cluster, the Cassandra Connector will need to be added to the Spark project. DataStax provides their own Cassandra Connector on GitHub and we will use that. This should output compiled jar files to the directory named “target”. There will be two jar files, one for Scala and one for Java.

Does Spark support Cassandra?

They also use very efficient and low latency SSDs. This is a similar setup used in Cassandra database clusters, so these types of clusters can run Spark + Cassandra on the same machine types using Cassandra instead of HDFS for storage.

How does Spark work with Cassandra?

How does it work? The fundamental idea is quite simple: Spark and Cassandra clusters are deployed to the same set of machines. Cassandra stores the data; Spark worker nodes are co-located with Cassandra and do the data processing. Spark is a batch-processing system, designed to deal with large amounts of data.


1 Answers

To create DStream out of a Cassandra table, you can use a ConstantInputDStream providing the RDD created out of the Cassandra table as input. This will result in the RDD being materialized on each DStream interval.

Be warned that large tables or tables that continuously grow in size will negatively impact performance of your Streaming job.

See also: Reading from Cassandra using Spark Streaming for an example.

like image 189
maasg Avatar answered Sep 23 '22 22:09

maasg