Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python multiprocessing tool vs Py(Spark)

A newbie question, as I get increasingly confused with pyspark. I want to scale an existing python data preprocessing and data analysis pipeline. I realize if I partition my data with pyspark, I can't treat each partition as a standalone pandas data frame anymore, and need to learn to manipulate with pyspark.sql row/column functions, and change a lot of existing code, plus I am bound to spark mllib libraries and can't take full advantage of more mature scikit-learn package. Then why would I ever need to use Spark if I can use multiprocessing tools for cluster computing and parallelize tasks on existing dataframe?

like image 702
JPiter Avatar asked Jun 14 '17 22:06

JPiter


People also ask

Is PySpark same as Python and Spark?

PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language.

When would you use a multiprocessing pool?

Use the multiprocessing pool if your tasks are independent. This means that each task is not dependent on other tasks that could execute at the same time. It also may mean tasks that are not dependent on any data other than data provided via function arguments to the task.

Does PySpark use Python?

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.


1 Answers

True, Spark does have the limitations you have mentioned, that is you are bounded in the functional spark world (spark mllib, dataframes etc). However, what it provides vs other multiprocessing tools/libraries is the automatic distribution, partition and rescaling of parallel tasks. Scaling and scheduling spark code becomes an easier task than having to program your custom multiprocessing code to respond to larger amounts of data + computations.

like image 58
Grigoropoulos Stathis Avatar answered Sep 26 '22 01:09

Grigoropoulos Stathis