Why is Apache-Spark - Python so slow locally as compared to pandas?

Tags:

A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:

pyspark --master local[2]

I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.

However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.

I was wondering what could be a possible reason for this. I have a couple of thoughts.

Do built-in functions do the process of serialization/de-serialization inefficiently? If yes, what are the alternatives to them?
Is the data set too small that it cannot outrun the overhead cost of the underlying JVM on which spark runs?

Thanks for looking. Much appreciated.

467

asked Feb 15 '18 20:02

Raj

1 Answers

Because:

Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
Because local mode is not designed for performance. It is used for testing.
Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison

You can go on like this for a long time...

121

answered Sep 24 '22 06:09

user9366962

Related questions
                            
                                Python pandas: select columns with all zero entries in dataframe
                            
                                How to create HTTPS tornado server
                            
                                Using "and" and "or" operator with Python strings [duplicate]
                            
                                NumPy - What is the difference between frombuffer and fromstring?
                            
                                Yield from coroutine vs yield from task
                            
                                How can I normalize the data in a range of columns in my pandas dataframe
                            
                                Python setting Decimal Place range without rounding?
                            
                                Django get_or_create fails to set field when used with iexact
                            
                                Pandas rolling gives NaN
                            
                                Generate random UTF-8 string in Python
                            
                                What should people new to Python know about its community and ecosystem? [closed]
                            
                                Modify default queryset in django
                            
                                Django unique_together not preventing duplicates
                            
                                Django REST Framework CSRF Failed: CSRF cookie not set
                            
                                Running Python in PowerShell?
                            
                                How do you index on a jinja template?
                            
                                How to reset a DataFrame's indexes for all groups in one step?
                            
                                Python 'map' function inserting NaN, possible to return original values instead?
                            
                                Getting full tweet text from "user_timeline" with tweepy
                            
                                Python Pathlib path object not converting to string [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is Apache-Spark - Python so slow locally as compared to pandas?

Tags:

python

pandas

apache-spark

apache-spark-sql

pyspark

Raj

People also ask

1 Answers

user9366962

Recent Activity

Donate For Us