A newbie question, as I get increasingly confused with pyspark. I want to scale an existing python data preprocessing and data analysis pipeline. I realize if I partition my data with pyspark, I can't treat each partition as a standalone pandas data frame anymore, and need to learn to manipulate with pyspark.sql row/column functions, and change a lot of existing code, plus I am bound to spark mllib libraries and can't take full advantage of more mature scikit-learn package. Then why would I ever need to use Spark if I can use multiprocessing tools for cluster computing and parallelize tasks on existing dataframe?
PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language.
Use the multiprocessing pool if your tasks are independent. This means that each task is not dependent on other tasks that could execute at the same time. It also may mean tasks that are not dependent on any data other than data provided via function arguments to the task.
PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.
True, Spark does have the limitations you have mentioned, that is you are bounded in the functional spark world (spark mllib, dataframes etc). However, what it provides vs other multiprocessing tools/libraries is the automatic distribution, partition and rescaling of parallel tasks. Scaling and scheduling spark code becomes an easier task than having to program your custom multiprocessing code to respond to larger amounts of data + computations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With