Spark newbie here. I tried to do some pandas action on my data frame using Spark, and surprisingly it's slower than pure Python (i.e. using pandas package in Python). Here's what I did: 1) In Spark: <pre class="prettyprint"><code>train_df.filter(train_df.gender == '-unknown-').count() </code></pre> It takes about 30 seconds to get results back. But using Python it takes about 1 second. 2) In Spark: <pre class="prettyprint"><code>sqlContext.sql("SELECT gender, count(*) FROM train GROUP BY gender").show() </code></pre> Same thing, takes about 30 sec in Spark, 1 sec in Python. Several possible reasons my Spark is much slower than pure Python: 1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark. 2) My spark is running locally and I should run it in something like Amazon EC instead. 3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook. 4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.) Which one of these is most likely the reason, or the most credible explanation? I would love to hear from some Spark experts. Thank you very much!!

Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets. By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default. You need to change it to 10 or what ever valueby using <code>sqlContext.sql("set spark.sql.shuffle.partitions=10");</code>. It will be definitely faster than with default. <blockquote> 1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark. </blockquote> You are right, you will not see much difference at lower volumes. Spark can be slower as well. <blockquote> 2) My spark is running locally and I should run it in something like Amazon EC instead. </blockquote> For your volume it might not help much. <blockquote> 3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook. </blockquote> Again it does not matter for 20MB data set. <blockquote> 4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.) </blockquote> On stand alone there will be difference. Python has more run time overhead than scala, but on larger cluster with distributed capability it need not matter

Why does my Spark run slower than pure Python? Performance comparison

Tags:

performance

python

apache-spark

apache-spark-sql

pyspark

Spark newbie here. I tried to do some pandas action on my data frame using Spark, and surprisingly it's slower than pure Python (i.e. using pandas package in Python). Here's what I did:

1) In Spark:

train_df.filter(train_df.gender == '-unknown-').count()

It takes about 30 seconds to get results back. But using Python it takes about 1 second.

2) In Spark:

sqlContext.sql("SELECT gender, count(*) FROM train GROUP BY gender").show()

Same thing, takes about 30 sec in Spark, 1 sec in Python.

Several possible reasons my Spark is much slower than pure Python:

1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.

2) My spark is running locally and I should run it in something like Amazon EC instead.

3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.

4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)

Which one of these is most likely the reason, or the most credible explanation? I would love to hear from some Spark experts. Thank you very much!!

536

asked Jan 06 '16 04:01

Vicky Zhang

1 Answers

Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets.

By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default. You need to change it to 10 or what ever valueby using sqlContext.sql("set spark.sql.shuffle.partitions=10");. It will be definitely faster than with default.

1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.

You are right, you will not see much difference at lower volumes. Spark can be slower as well.

2) My spark is running locally and I should run it in something like Amazon EC instead.

For your volume it might not help much.

3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.

Again it does not matter for 20MB data set.

4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)

On stand alone there will be difference. Python has more run time overhead than scala, but on larger cluster with distributed capability it need not matter

196

answered Oct 21 '22 10:10

Durga Viswanath Gadiraju

Related questions
                            
                                Django 1.7 makemigrations freezing/hanging
                            
                                Having trouble installing pycurl on windows
                            
                                TypeError: 'cmp' is an invalid keyword argument for this function
                            
                                Swiss tournament - pairing algorithm
                            
                                Getting the confidence level of detectMultiscale in OpenCV with Python?
                            
                                uWSGI python highload configuration
                            
                                Python and Java parameter passing [duplicate]
                            
                                matplotlib: set title color in stylesheet
                            
                                Nest a flat list based on an arbitrary criterion
                            
                                Time complexity of python "set.intersection" for n sets
                            
                                Pylint for half-implemented abstract classes
                            
                                How to do `PUT` on Amazon S3 using Python Requests
                            
                                Python: POSIX character class in regex?
                            
                                Python + WSGI - Can't import my own modules from a directory?
                            
                                Why is bytearray not a Sequence in Python 2?
                            
                                Preserving Column Order - Python Pandas and Column Concat
                            
                                Is there a way to have platform-specific dependencies in environment.yml?
                            
                                Django SimpleUploadedFile with Python 3
                            
                                Cannot press button
                            
                                multiple assignments with a comma in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With