Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Apache-Spark - Python so slow locally as compared to pandas?

A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:

pyspark --master local[2] 

I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.

However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.

I was wondering what could be a possible reason for this. I have a couple of thoughts.

  1. Do built-in functions do the process of serialization/de-serialization inefficiently? If yes, what are the alternatives to them?
  2. Is the data set too small that it cannot outrun the overhead cost of the underlying JVM on which spark runs?

Thanks for looking. Much appreciated.

like image 467
Raj Avatar asked Feb 15 '18 20:02

Raj


People also ask

Is Spark slower than Pandas?

Due to parallel execution on all cores on multiple machines, PySpark runs operations faster than Pandas, hence we often required to covert Pandas DataFrame to PySpark (Spark with Python) for better performance.

Why is Apache Spark slow?

Sometimes, Spark runs slowly because there are too many concurrent tasks running. The capacity for high concurrency is a beneficial feature, as it provides Spark-native fine-grained sharing. This leads to maximum resource utilization while cutting down query latencies.

What is the difference between a Spark and a Pandas DataFrame?

Spark DataFrame is distributed and hence processing in the Spark DataFrame is faster for a large amount of data. Pandas DataFrame is not distributed and hence processing in the Pandas DataFrame will be slower for a large amount of data.

Which is better Python or Spark?

Speed of performanceScala is faster than Python due to its static type language. If faster performance is a requirement, Scala is a good bet. Spark is native in Scala, hence making writing Spark jobs in Scala the native way.


1 Answers

Because:

  • Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
  • Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
  • Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
  • Because local mode is not designed for performance. It is used for testing.
  • Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
  • Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison

You can go on like this for a long time...

like image 121
user9366962 Avatar answered Sep 24 '22 06:09

user9366962