I have dataframe and i want to save in single file on hdfs location.
i found the solution here Write single CSV file using spark-csv
df.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
But all data will be written to mydata.csv/part-00000 and i wanted to be mydata.csv file.
is that possible?
any help appreciate
Write a Single file using Spark coalesce() & repartition() When you are ready to write a DataFrame, first use Spark repartition() and coalesce() to merge data from all partitions into a single partition and then save it to a file.
It's not possible using standard spark library, but you can use Hadoop API for managing filesystem - save output in temporary directory and then move file to the requested path. For example (in pyspark):
df.coalesce(1) \
.write.format("com.databricks.spark.csv") \
.option("header", "true") \
.save("mydata.csv-temp")
from py4j.java_gateway import java_import
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
file = fs.globStatus(sc._jvm.Path('mydata.csv-temp/part*'))[0].getPath().getName()
fs.rename(sc._jvm.Path('mydata.csv-temp/' + file), sc._jvm.Path('mydata.csv'))
fs.delete(sc._jvm.Path('mydata.csv-temp'), True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With