Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark save(write) parquet only one file

if i write

dataFrame.write.format("parquet").mode("append").save("temp.parquet")

in temp.parquet folder i got the same file numbers as the row numbers

i think i'm not fully understand about parquet but is it natural?

like image 673
Easyhyum Avatar asked Aug 01 '18 08:08

Easyhyum


People also ask

How do I save a Parquet file in Spark?

Spark provides the capability to append DataFrame to existing parquet files using “append” save mode. In case, if you want to overwrite use “overwrite” save mode.

Can Parquet file be split?

Parquet files should not be split into multiple hdfs-blocks problem and strange record count issue.

Is writing to parquet faster than CSV?

We showed how by storing large data files in Parquet format (instead of traditional CSV) and using PyArrow utility methods, we can achieve faster processing time, especially in situations where the file read operation takes significantly more time than the actual data processing.

Why do we partition parquet files?

An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values.


1 Answers

Use coalesce before write operation

dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")


EDIT-1

Upon a closer look, the docs do warn about coalesce

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)

Therefore as suggested by @Amar, it's better to use repartition

like image 193
y2k-shubham Avatar answered Sep 17 '22 14:09

y2k-shubham