I am using Spark 2.1.0 and Kafka 0.9.0. I am trying to push the output of a batch spark job to kafka. The job is supposed to run every hour but not as streaming. While looking for an answer on the net I could only find kafka integration with Spark streaming and nothing about the integration with the batch job. Does anyone know if such thing is feasible ? Thanks UPDATE : As mentioned by user8371915, I tried to follow what was done in Writing the output of Batch Queries to Kafka. I used a spark shell : <pre class="prettyprint"><code>spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0 </code></pre> Here is the simple code that I tried : <pre class="prettyprint"><code>val df = Seq(("Rey", "23"), ("John", "44")).toDF("key", "value") val newdf = df.select(to_json(struct(df.columns.map(column):_*)).alias("value")) newdf.write.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("topic", "alerts").save() </code></pre> But I get the error : <pre class="prettyprint"><code>java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:497) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) ... 50 elided </code></pre> Have any idea what is this related to ? Thanks

tl;dr You use outdated Spark version. Writes are enabled in 2.2 and later. Out-of-the-box you can use Kafka SQL connector (the same as used with Structured Streaming). Include <ul> <li> <code>spark-sql-kafka</code> in your dependencies.</li> <li>Convert data to <code>DataFrame</code> containing at least <code>value</code> column of type <code>StringType</code> or <code>BinaryType</code>.</li> <li> Write data to Kafka: <pre class="prettyprint"><code>df .write .format("kafka") .option("kafka.bootstrap.servers", server) .save() </code></pre> </li> </ul> Follow Structured Streaming docs for details (starting with Writing the output of Batch Queries to Kafka).

How to write a Dataset to Kafka topic?

Tags:

scala

apache-kafka

apache-spark

apache-spark-sql

I am using Spark 2.1.0 and Kafka 0.9.0.

I am trying to push the output of a batch spark job to kafka. The job is supposed to run every hour but not as streaming.

While looking for an answer on the net I could only find kafka integration with Spark streaming and nothing about the integration with the batch job.

Does anyone know if such thing is feasible ?

Thanks

UPDATE :

As mentioned by user8371915, I tried to follow what was done in Writing the output of Batch Queries to Kafka.

I used a spark shell :

spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0

Here is the simple code that I tried :

val df = Seq(("Rey", "23"), ("John", "44")).toDF("key", "value")
val newdf = df.select(to_json(struct(df.columns.map(column):_*)).alias("value"))
newdf.write.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("topic", "alerts").save()

But I get the error :

java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:497)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 50 elided

Have any idea what is this related to ?

Thanks

411

asked Apr 06 '18 13:04

Azzy

1 Answers

tl;dr You use outdated Spark version. Writes are enabled in 2.2 and later.

Out-of-the-box you can use Kafka SQL connector (the same as used with Structured Streaming). Include

spark-sql-kafka in your dependencies.
Convert data to DataFrame containing at least value column of type StringType or BinaryType.

Write data to Kafka:

df   
  .write
  .format("kafka")
  .option("kafka.bootstrap.servers", server)
  .save()

Follow Structured Streaming docs for details (starting with Writing the output of Batch Queries to Kafka).

183

answered Sep 19 '22 22:09

Alper t. Turker

Related questions
                            
                                Scala-Spark Dynamically call groupby and agg with parameter values
                            
                                Spark random forest binary classifier metrics
                            
                                Local assignment affects type?
                            
                                How to put a variable into z ZeppelinContext in javascript in Zeppelin?
                            
                                Spark History Server on S3A FileSystem: ClassNotFoundException
                            
                                Can non-persistent data structures be used in a purely functional way?
                            
                                Generic Numeric division
                            
                                Chain functions in different way
                            
                                value read is not a member of org.apache.spark.SparkContext
                            
                                scala.MatchError: [Ljava.lang.String; (of class [Ljava.lang.String;)
                            
                                Inserting Data Into Cassandra table Using Spark DataFrame
                            
                                Dropping columns by data type in Scala Spark
                            
                                Spark: unpersist RDDs for which I have lost the reference
                            
                                How do I extract a Future[String] from an akka Source[ByteString, _]?
                            
                                Workaround for importing spark implicits everywhere
                            
                                Can Flink be used with Kotlin?
                            
                                sbt: How do I resolve Maven dependencies that uses Maven properties
                            
                                How to pass Java_opts before an executable to entrypoint in dockerfile?
                            
                                Mocking a method which returns an fs2.Stream
                            
                                StackOverflowError when operating with a large number of columns in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With