Spark on localhost

Tags:

pyspark

For testing purposes, while I don´t have production cluster, I am using spark locally:

print('Setting SparkContext...')
sconf = SparkConf()
sconf.setAppName('myLocalApp')
sconf.setMaster('local[*]')
sc = SparkContext(conf=sconf)
print('Setting SparkContext...OK!')

Also, I am using a very very small dataset, consisting of only 20 rows in a postgresql database ( ~2kb)

Also(!), my code is quite simple as well, only grouping 20 rows by a key and applying a trivial map operation

params = [object1, object2]
rdd = df.rdd.keyBy(lambda x: (x.a, x.b, x.c)) \
                          .groupByKey() \
                          .mapValues(lambda value: self.__data_interpolation(value, params))


def __data_interpolation(self, data, params):
    # TODO: only for testing
    return data

What bothers me is that the whole execution takes about 5 minutes!!

Inspecting the Spark UI, I see that most of the time was spent in Stage 6: byKey method. (Stage 7, collect() method was also slow...)

Some info:

enter image description here

These numbers make no sense to me... Why do I need 22 tasks, executing for 54 sec, to process less than 1 kb of data

Can it be a network issue, trying to figure out the ip address of localhost? I don't know... Any clues?

788

asked Nov 03 '16 20:11

2 Answers

It appears the main reason for the slower performance in your code snippet is due to the use of groupByKey(). The issue with groupByKey is that it ends up shuffling all of the key-value pairs resulting in a lot of data unnecessarily being transferred. A good reference to explain this issue is Avoid GroupByKey.

To work around this issue, you can:

Try using reduceByKey which should be faster (more info is also included in the above Avoid GroupByKey link).
Use DataFrames (instead of RDDs) as DFs include performance optimizations (and the DF GroupBy statement is faster than the RDD version). As well, as you're using Python, you can avoid the Python-to-JVM issues with PySpark RDDs. More information on this can be seen in PySpark Internals

By the way, reviewing the Spark UI diagram above, the #22 refers to the task # within the DAG (not the number of tasks executed).

HTH!

114

answered Oct 23 '22 06:10

Denny Lee

I suppose the "postgresql" is the key to solve that puzzle.

keyBy is probably the first operation that really uses the data so it's execution time is bigger as it needs to get the data from external database. You can verify it by adding at the beginning:

df.cache()
df.count() # to fill the cache
df.rdd.keyBy....

If I am right, you need to optimize the database. It may be:

Network issue (slow network to DB server)
Complicated (and slow) SQL on this database (try it using postgre shell)
Some authorization difficulties on DB server
Problem with JDBC driver you use

answered Oct 23 '22 07:10

Mariusz

Related questions
                            
                                Using python lime as a udf on spark
                            
                                UDF not working in Spark SQL
                            
                                Spark Streaming with a dynamic lookup table
                            
                                Object spark is not a member of package org
                            
                                How to get a spark job's metrics?
                            
                                Is this a bug of spark stream or memory leak?
                            
                                PySpark s3 Access with Multiple AWS Credential Profiles?
                            
                                What to use to have graphical view of Spark's memory usage (with YARN)?
                            
                                Apache Spark sort partition by user ID and write each partition to CSV
                            
                                Why does sbt assembly fail with "Not a valid command: assembly"?
                            
                                Lost executor Spark
                            
                                PySpark: Numpy memory not being released in executor map-partition function (memory leak)
                            
                                Joining Spark DataFrames on a nearest key condition
                            
                                I cannot use --package option on bitnami/spark docker container
                            
                                Spark MLlib - Collaborative Filtering Implicit Feed
                            
                                Spark: What is the time complexity of the connected components algorithm used in GraphX?
                            
                                How to repartition evenly in Spark?
                            
                                Out of memory error when writing out spark dataframes to parquet format
                            
                                Difference between a map and udf
                            
                                Cassandra Error message: Not marking nodes down due to local pause. Why?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark on localhost

Tags:

apache-spark

pyspark

guilhermecgs

People also ask

2 Answers

Denny Lee

Mariusz

Recent Activity

Donate For Us