Spark running very slow on a very small data set

Question

The following simple spark program takes 4 minutes to run. I don't know what's wrong with this code.

First, I generate a VERY small rdd

D = spark.sparkContext.parallelize([(0,[1,2,3]),(1,[2,3]),(2,[0,3]),(3,[1])]).cache()

Then I generate a vector

P1 = spark.sparkContext.parallelize(list(zip(list(range(4)),[1/4]*4))).cache()

Then I defines a function to do the map step

def MyFun(x):
    L0 = len(x[2])
    L = []
    for i in x[2]:
         L.append((i,x[1]/L0))
    return L

Then I execute the following code

P0 = P1
D0 = D.join(P1).map(lambda x: [x[0],x[1][1],x[1][0]]).cache()
C0 = D0.flatMap(lambda x: MyFun(x)).cache()
P1 = C0.reduceByKey(lambda x,y:x+y).mapValues(lambda x:x*1.2+3.4).sortByKey().cache()
Diff = P1.join(P0).map(lambda x: abs(x[1][0]-x[1][1])).sum()

Given my data is so small, I couldn't figure out a reason why this piece of code runs so slow...

BlueSheepToken · Accepted Answer

I have a few suggestions to help you speed up this job.

Cache only when needed

The process of caching is to write the dag you created on the disk. So caching every step might cost a lot instead of speeding up the process.

I would suggest you to cache only P1.

Use DataFrames to allow Spark to help you

Afterwards, I strongly suggest you to use the DataFrame api, Spark will be able to do some optimizations for you, such as push down predicates optimizations.

The last, but not the least, using custom functions cost a lot as well. If you are using DataFrames, try to use only existing functions from the org.apache.spark.sql.functions module.

Profile the code with Spark UI

I also suggest to profile your code via the Spark UI, because it might not be a problem of your code since you have a small data, but a problem with the nodes.

Spark running very slow on a very small data set

Tags:

python

apache-spark

pyspark

mapreduce

KevinKim

1 Answers

Cache only when needed

Use DataFrames to allow Spark to help you

Profile the code with Spark UI

BlueSheepToken

Recent Activity

Donate For Us

Spark running very slow on a very small data set

Tags:

python

apache-spark

pyspark

mapreduce

KevinKim

1 Answers

Cache only when needed

Use DataFrames to allow Spark to help you

Profile the code with Spark UI

BlueSheepToken

Related questions

Recent Activity

Donate For Us