How to implement a RabbitMQ consumer using Pyspark Streaming module?

Question

I have an Apache Spark cluster and a RabbitMQ broker and I want to consume messages and compute some metrics using the pyspark.streaming module.

The problem is I only found this package, but is implemented in Java and Scala. Besides that, I didn't find any example or bridge implementation in Python.

I have a consumer implemented using Pika but I don't know how to pass the payload to my StreamingContext.

jocerfranquiz · Accepted Answer

This solution uses pika asynchronous consumer example and socketTextStream method from Spark Streaming

Download the example and save it as a .py file
Modify the file to use your own RabbitMQ credentials and connection parameters. In my case I had to modify the Consumer class

Under if __name__ == '__main__': we need to open a socket with the HOST and PORT corresponding to your TCP connection to Spark Streaming. We must save the method sendall from socket into a variable pass this to the Consumer class

with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
  s.bind((HOST, PORT))
  s.listen(1)
  conn, addr = s.accept()
  dispatcher = conn.sendall #assigning sendall to dispatcher variable
consumer = Consumer(dispatcher)
try:
  consumer.run()
except Exception as e:
  consumer.stop()
  s.close()

Modify the __init__ method in Consumer to pass the dispatcher

def __init__(self,dispatcher):
  self._connection = None
  self._channel = None
  self._closing = False
  self._consumer_tag = None
  self._url = amqp_url
  #new code
  self._dispatcher = dispatcher

In the method on_message inside the Consumer we call self._dispatcher to send the body of the AMQP message

def on_message(self, unused_channel, basic_deliver, properties, body):
  self._channel.basic_ack(basic_deliver.delivery_tag)
  try:
    # we need an '
' at the each row Spark socketTextStream
    self._dispatcher(bytes(body.decode("utf-8")+'
',"utf-8"))
  except Exception as e:
    raise

In Spark, put ssc.socketTextStream(HOST, int(PORT)) with HOST and PORT corresponding to our TCP socket. Spark will manage the connection
Run first the consumer and then the Spark application

Final remarks:

Try to run your consumer on a different machine rather than your Spark machine
Any port over 10000 should be ok. Don't let the kernel to open some random port
Platform: Linux Debian 7 and 8, and Ubuntu 14.04 and 16.04
Pika version 0.10.0
Python version 3.5.2
Spark version 1.6.1, 1.6.2, and 2.0.0

How to implement a RabbitMQ consumer using Pyspark Streaming module?

Tags:

python

rabbitmq

pyspark

pika

spark-streaming

jocerfranquiz

1 Answers

jocerfranquiz

Recent Activity

Donate For Us

How to implement a RabbitMQ consumer using Pyspark Streaming module?

Tags:

python

rabbitmq

pyspark

pika

spark-streaming

jocerfranquiz

1 Answers

jocerfranquiz

Related questions

Recent Activity

Donate For Us