Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Output Dstream of Apache Spark in Python

I 'm trying the technologies that i will be using to build a real-time data pipeline, and i have run into some issues exporting my contents to a file.

I have setup a local kafka cluster, and a node.js producer that sends a simple text message just to test functionality and get a rough estimate of complexity of implementation.

This is the spark streaming job that is reading from kafka and i am trying to get it to write to a file.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "KafkaStreamingConsumer")
ssc = StreamingContext(sc, 10)

kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "consumer-group", {"test": 1})

kafkaStream.saveAsTextFile('out.txt')

print 'Event recieved in window: ', kafkaStream.pprint()

ssc.start()
ssc.awaitTermination()

The error i am seeing when submitting the spark job is:

kafkaStream.saveAsTextFile('out.txt')
AttributeError: 'TransformedDStream' object has no attribute 'saveAsTextFile'

No computations or transformations are performed on the data, i just want to build the flow. What do I need to change/add to be able to export the data in a file?

like image 729
Stelios Savva Avatar asked Aug 12 '15 13:08

Stelios Savva


1 Answers

http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html

saveAsTextFiles (note the plural)

saveAsTextFile (singular) is a method on an RDD, not a DStream.

like image 176
Cody Koeninger Avatar answered Sep 28 '22 07:09

Cody Koeninger