I have a Django application that interacts with a Cassandra database and I want to try using Apache Spark to run operations on this database. I have some experience with Django and Cassandra but I'm new to Apache Spark.
I know that to interact with a Spark cluster first I need to create a SparkContext, something like this:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
My question is the following: how should I treat this context? Should I instantiate it when my application starts and let it live during it's execution or should I start a SparkContext everytime before running an operation in the cluster and then kill it when the operation finishes?
Thank you in advance.
For the last days I've been working on this, since no one answered I will post what was my approach.
Apparently creating a SparkContext generates a bit of overhead, so stopping the context after every operation is not a good idea.
Also, there is no downfall, apparently, on letting the context live while the application runs.
Therefore, my approach was treating the SparkContext like a database connection, I created a singleton that instantiates the context when the application starts running and used it where needed.
I hope this can be helpful to someone, and I am open to new suggestions on how to deal with this, I'm still new to Apache Spark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With