Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark partitionBy much slower than without it

I tested writing with:

 df.write.partitionBy("id", "name")
    .mode(SaveMode.Append)
    .parquet(filePath)

However if I leave out the partitioning:

 df.write
    .mode(SaveMode.Append)
    .parquet(filePath)

It executes 100x(!) faster.

Is it normal for the same amount of data to take 100x longer to write when partitioning?

There are 10 and 3000 unique id and name column values respectively. The DataFrame has 10 additional integer columns.

like image 206
BAR Avatar asked Oct 01 '15 23:10

BAR


Video Answer


1 Answers

The first code snippet will write a parquet file per partition to file system (local or HDFS). This means that if you have 10 distinct ids and 3000 distinct names this code will create 30000 files. I suspect that overhead of creating files, writing parquet metadata, etc is quite large (in addition to shuffling).

Spark is not the best database engine, if your dataset fits in memory I suggest to use a relational database. It will be faster and easier to work with.

like image 91
kostya Avatar answered Sep 17 '22 17:09

kostya