Python multiprocessing tool vs Py(Spark)

Tags:

A newbie question, as I get increasingly confused with pyspark. I want to scale an existing python data preprocessing and data analysis pipeline. I realize if I partition my data with pyspark, I can't treat each partition as a standalone pandas data frame anymore, and need to learn to manipulate with pyspark.sql row/column functions, and change a lot of existing code, plus I am bound to spark mllib libraries and can't take full advantage of more mature scikit-learn package. Then why would I ever need to use Spark if I can use multiprocessing tools for cluster computing and parallelize tasks on existing dataframe?

702

asked Jun 14 '17 22:06

JPiter

1 Answers

True, Spark does have the limitations you have mentioned, that is you are bounded in the functional spark world (spark mllib, dataframes etc). However, what it provides vs other multiprocessing tools/libraries is the automatic distribution, partition and rescaling of parallel tasks. Scaling and scheduling spark code becomes an easier task than having to program your custom multiprocessing code to respond to larger amounts of data + computations.

answered Sep 26 '22 01:09

Grigoropoulos Stathis

Related questions
                            
                                How to set SOCKS5 proxy with auth for Chrome in selenium Python?
                            
                                pip install tesserocr fails with error " Failed building wheel for tesserocr"
                            
                                Getting results as JSON from BigQuery with google-cloud-python
                            
                                How To Pagination Angular2 with Django Rest Framework API
                            
                                Nginx cuts off static files downloads early
                            
                                Affinity propagation preference parameter
                            
                                Sort combinations by sum of its elements in Python
                            
                                openCV: The function is not implemented
                            
                                Google analytics .dat file missing, falling back to noauth_local_webserver
                            
                                How can I go to symbol across python files in Visual Studio Code
                            
                                Blob download_as_string SSL error on Google Container Engine
                            
                                How much copies of the environment does spark do?
                            
                                Automatic labeling of LDA generated topics
                            
                                GRPC streaming select (python)
                            
                                Error with Python import LightGBM
                            
                                Built a new Flask app -- old one still showing in browser
                            
                                Error Message: Tried to run command without establishing a connection When running multiple tests with unit test
                            
                                How to invert differencing in a Python statsmodels ARIMA forecast?
                            
                                Django Makemessages CommandError ASCII Encoding
                            
                                if let syntax in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python multiprocessing tool vs Py(Spark)

Tags:

python

multiprocessing

scikit-learn

pyspark

cluster-computing

JPiter

People also ask

1 Answers

Grigoropoulos Stathis

Recent Activity

Donate For Us