if i write
dataFrame.write.format("parquet").mode("append").save("temp.parquet")
in temp.parquet folder i got the same file numbers as the row numbers
i think i'm not fully understand about parquet but is it natural?
Spark provides the capability to append DataFrame to existing parquet files using “append” save mode. In case, if you want to overwrite use “overwrite” save mode.
Parquet files should not be split into multiple hdfs-blocks problem and strange record count issue.
We showed how by storing large data files in Parquet format (instead of traditional CSV) and using PyArrow utility methods, we can achieve faster processing time, especially in situations where the file read operation takes significantly more time than the actual data processing.
An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values.
Use coalesce
before write operation
dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")
EDIT-1
Upon a closer look, the docs do warn about coalesce
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)
Therefore as suggested by @Amar, it's better to use repartition
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With