Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to manage a Apache Spark context in Django?

I have a Django application that interacts with a Cassandra database and I want to try using Apache Spark to run operations on this database. I have some experience with Django and Cassandra but I'm new to Apache Spark.

I know that to interact with a Spark cluster first I need to create a SparkContext, something like this:

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)

My question is the following: how should I treat this context? Should I instantiate it when my application starts and let it live during it's execution or should I start a SparkContext everytime before running an operation in the cluster and then kill it when the operation finishes?

Thank you in advance.

like image 848
Pedro Bernardo Avatar asked Sep 07 '16 15:09

Pedro Bernardo


1 Answers

For the last days I've been working on this, since no one answered I will post what was my approach.

Apparently creating a SparkContext generates a bit of overhead, so stopping the context after every operation is not a good idea.

Also, there is no downfall, apparently, on letting the context live while the application runs.

Therefore, my approach was treating the SparkContext like a database connection, I created a singleton that instantiates the context when the application starts running and used it where needed.

I hope this can be helpful to someone, and I am open to new suggestions on how to deal with this, I'm still new to Apache Spark.

like image 100
Pedro Bernardo Avatar answered Oct 21 '22 16:10

Pedro Bernardo