I 'm trying the technologies that i will be using to build a real-time data pipeline, and i have run into some issues exporting my contents to a file.
I have setup a local kafka cluster, and a node.js producer that sends a simple text message just to test functionality and get a rough estimate of complexity of implementation.
This is the spark streaming job that is reading from kafka and i am trying to get it to write to a file.
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "KafkaStreamingConsumer")
ssc = StreamingContext(sc, 10)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "consumer-group", {"test": 1})
kafkaStream.saveAsTextFile('out.txt')
print 'Event recieved in window: ', kafkaStream.pprint()
ssc.start()
ssc.awaitTermination()
The error i am seeing when submitting the spark job is:
kafkaStream.saveAsTextFile('out.txt')
AttributeError: 'TransformedDStream' object has no attribute 'saveAsTextFile'
No computations or transformations are performed on the data, i just want to build the flow. What do I need to change/add to be able to export the data in a file?
http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html
saveAsTextFiles (note the plural)
saveAsTextFile (singular) is a method on an RDD, not a DStream.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With