Direct Kafka Stream with PySpark (Apache Spark 1.6)

Tags:

I'm trying to leverage the direct kafka consumer (new feature available in python), to capture data from a custom Kafka Producer that I'm running on localhost:9092.

I'm currently using the "direct_kafka_wordcount.py" as provided by the spark 1.6 example scripts.

Source: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/direct_kafka_wordcount.py

DOCS: http://spark.apache.org/docs/latest/streaming-kafka-integration.html

I'm using the following command the run the program:

    ~/spark-1.6.0/bin/spark-submit --jars 
    ~/spark-1.6.0/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.6.0.jar           
direct_kafka_wordcount.py localhost:9092 twitter.live

Unfortunately, I'm getting a strange error, which I'm not able to debug. Any tips/suggestions will be immensely appreciated.

py4j.protocol.Py4JJavaError: An error occurred while calling o24.createDirectStreamWithoutMessageHandler.
: org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
        at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
        at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
        at scala.util.Either.fold(Either.scala:97)
        at org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
        at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
        at org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper.createDirectStream(KafkaUtils.scala:720)
        at org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper.createDirectStreamWithoutMessageHandler(KafkaUtils.scala:688)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

621

asked Feb 27 '16 22:02

cynical biscuit

Video Answer

2 Answers

The error:

java.nio.channels.ClosedChannelException

means the topic does not exist, or the brokers are not reachable or there is some network(proxy) kind of issue.

Make sure there is no such connectivity issue by running kafka-console-consumer on the spark master & worker nodes.

187

answered Nov 11 '22 18:11

Mohitt

I had similar problem. But turn out to be different solution. I had different versions of scala running for spark and kafka.

I ended up using same version on both side then pyspark was able to generate the classes.

I used following

Spark: spark-1.6.3-bin-hadoop2.6.tgz spark-streaming-kafka: spark-streaming-kafka-assembly_2.10-1.6.3.jar

answered Nov 11 '22 19:11

chandank

Related questions
                            
                                Unable to read images simultaneously [in parallels] using pyspark
                            
                                How to parse datetime that is coming in Arabic text (٠٤-٢٥-٢٠٢١) to English dates in Pyspark
                            
                                NullPointerException in spark-sql
                            
                                Issue understanding splitting data in Scala using "randomSplit" for Machine Learning purpose
                            
                                How to turn a known structured RDD to Vector
                            
                                Passing Functions to Spark: What is the risk of referencing the whole object?
                            
                                How to achieve sort by value in spark java
                            
                                How to map filenames to RDD using sc.textFile("s3n://bucket/*.csv")?
                            
                                Spark configuration, what is the difference of SPARK_DRIVER_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_WORKER_MEMORY?
                            
                                Cassandra storage internal
                            
                                Apache Spark: Error while starting PySpark
                            
                                Spark Streaming on a S3 Directory
                            
                                Spark Cassandra connector filtering with IN clause
                            
                                How to do performance profiling of Hadoop cluster
                            
                                Spark mllib predicting weird number or NaN
                            
                                Is HDFS necessary for Spark workloads?
                            
                                How to use window functions in PySpark using DataFrames?
                            
                                How to include spark tests as Maven dependency
                            
                                dataframe filter gives NullPointerException
                            
                                spark finding max value and the associated key

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Direct Kafka Stream with PySpark (Apache Spark 1.6)

Tags:

apache-kafka

apache-spark

pyspark

spark-streaming