I have the following spark job: <pre class="prettyprint"><code>from __future__ import print_function import os import sys import time from random import random from operator import add from pyspark.streaming import StreamingContext from pyspark import SparkContext,SparkConf from pyspark.streaming.kafka import KafkaUtils from pyspark.sql import SQLContext, Row from pyspark.streaming import StreamingContext from pyspark_cassandra import streaming,CassandraSparkContext if __name__ == "__main__": conf = SparkConf().setAppName("PySpark Cassandra Test") sc = CassandraSparkContext(conf=conf) stream = StreamingContext(sc, 2) rdd=sc.cassandraTable("keyspace2","users").collect() #print rdd stream.start() stream.awaitTermination() sc.stop() </code></pre> When I run this, it gives me the following error: <pre class="prettyprint"><code>ERROR StreamingContext: Error starting the context, marking it as stopped java.lang.IllegalArgumentException: requirement failed: \ No output operations registered, so nothing to execute </code></pre> the shell script I run: <pre class="prettyprint"><code>./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example s/src/main/python/test/reading-cassandra.py </code></pre> Comparing spark streaming with kafka, I have this line missing from the above code: <pre class="prettyprint"><code>kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1}) </code></pre> where I'm actually using <code>createStream</code> but for cassandra, I can't see anything like this on the docs. How do I start the streaming between spark streaming and cassandra? Versions: <pre class="prettyprint"><code>Cassandra v2.1.12 Spark v1.4.1 Scala 2.10 </code></pre>

To create DStream out of a Cassandra table, you can use a <code>ConstantInputDStream</code> providing the RDD created out of the Cassandra table as input. This will result in the RDD being materialized on each DStream interval. Be warned that large tables or tables that continuously grow in size will negatively impact performance of your Streaming job. See also: Reading from Cassandra using Spark Streaming for an example.

How to enable streaming from Cassandra to Spark?

Tags:

cassandra

apache-spark

pyspark

datastax

spark-streaming

I have the following spark job:

from __future__ import print_function

import os
import sys
import time
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming,CassandraSparkContext

if __name__ == "__main__":

    conf = SparkConf().setAppName("PySpark Cassandra Test")
    sc = CassandraSparkContext(conf=conf)
    stream = StreamingContext(sc, 2)

    rdd=sc.cassandraTable("keyspace2","users").collect()
    #print rdd
    stream.start()
    stream.awaitTermination()
    sc.stop()

When I run this, it gives me the following error:

ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: \
No output operations registered, so nothing to execute

the shell script I run:

./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example
s/src/main/python/test/reading-cassandra.py

Comparing spark streaming with kafka, I have this line missing from the above code:

kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1})

where I'm actually using createStream but for cassandra, I can't see anything like this on the docs. How do I start the streaming between spark streaming and cassandra?

Versions:

Cassandra v2.1.12
Spark v1.4.1
Scala 2.10

453

asked Jan 26 '16 12:01

HackCode

1 Answers

To create DStream out of a Cassandra table, you can use a ConstantInputDStream providing the RDD created out of the Cassandra table as input. This will result in the RDD being materialized on each DStream interval.

Be warned that large tables or tables that continuously grow in size will negatively impact performance of your Streaming job.

See also: Reading from Cassandra using Spark Streaming for an example.

189

answered Sep 23 '22 22:09

maasg

Related questions
                            
                                Job aborted due to stage failure: ShuffleMapStage 20 (repartition at data_prep.scala:87) has failed the maximum allowable number of times: 4
                            
                                apache spark: local[K] master URL - job gets stuck
                            
                                InvalidRequestException(why:empid cannot be restricted by more than one relation if it includes an Equal)
                            
                                Apache Spark (MLLib) for real time analytics
                            
                                how to fetch all of data from hbase table in spark
                            
                                Can I use Hadoop with AWS4-HMAC-SHA256?
                            
                                Why does Spark submit script spark-submit ignore `--num-executors`?
                            
                                How does the Apache Spark scheduler split files into tasks?
                            
                                How to let Spark serialize an object using Kryo?
                            
                                Spark job failing when calling first() in PySpark
                            
                                Apache Spark ALS recommendations approach
                            
                                In Apache Spark SQL, How to close metastore connection from HiveContext
                            
                                must build Spark with Hive (spark 1.5.0)
                            
                                Spark partitionBy much slower than without it
                            
                                Combining PyCharm, Spark and Jupyter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With