Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ImportError: No module named 'kafka' in databricks pyspark

I am not able to use the kafka library in the databricks notebook.

getting the error ImportError: No module named 'kafka'

from kafka import KafkaProducer
def send_to_kafka(rows):
    producer = KafkaProducer(bootstrap_servers = "localhost:9092")
    for row in rows:
        producer.send('topic', str(row.asDict()))  
        producer.flush()

df.foreachPartition(send_to_kafka)

databricks cluster info

spark version - 2.4.3

scala version - 2.11

Help me. Thanks in advance.

like image 704
Ajay Kumar Avatar asked Dec 22 '25 15:12

Ajay Kumar


1 Answers

Instead of doing this - it's very inefficient, just use kafka connector to write data, like this (you need first convert data into JSON string):

from pyspark.sql.functions import to_json, struct
df.select(to_json(struct("*")).alias("value"))\
  .write.format("kafka")\
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
  .option("topic", "topic1")\
  .save()
like image 59
Alex Ott Avatar answered Dec 24 '25 04:12

Alex Ott



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!