Merge Spark output CSV files with a single header

Tags:

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.

I have a Scala script that takes raw data from S3, processes it and writes it to HDFS or even S3 with Spark-CSV. I think I can use multiple files as input if I want to use AWS Machine Learning tool for training a prediction model. But if I want to use something else, I presume it is best if I receive a single CSV output file.

Currently, as I do not want to use repartition(1) nor coalesce(1) for performance purposes, I have used hadoop fs -getmerge for manual testing, but as it just merges the contents of the job output files, I am running into a small problem. I need a single row of headers in the data file for training the prediction model.

If I use .option("header","true") for the spark-csv, then it writes the headers to every output file and after merging I have as many lines of headers in the data as there were output files. But if the header option is false, then it does not add any headers.

Now I found an option to merge the files inside the Scala script with Hadoop API FileUtil.copyMerge. I tried this in spark-shell with the code below.

import org.apache.hadoop.fs.FileUtil import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; val configuration = new Configuration(); val fs = FileSystem.get(configuration); FileUtil.copyMerge(fs, new Path("smallheaders"), fs, new Path("/home/hadoop/smallheaders2"), false, configuration, "")

But this solution still just concatenates the files on top of each other and does not handle headers. How can I get an output file with only one row of headers?

I even tried adding df.columns.mkString(",") as the last argument for copyMerge, but this added the headers still multiple times, not once.

278

asked Jun 27 '16 14:06

V. Samma

2 Answers

you can walk around like this.

1.Create a new DataFrame(headerDF) containing header names.
2.Union it with the DataFrame(dataDF) containing the data.
3.Output the union-ed DataFrame to disk with option("header", "false").
4.merge partition files(part-0000**0.csv) using hadoop FileUtil

In this ways, all partitions have no header except for a single partition's content has a row of header names from the headerDF. When all partitions are merged together, there is a single header in the top of the file. Sample code are the following

  //dataFrame is the data to save on disk   //cast types of all columns to String   val dataDF = dataFrame.select(dataFrame.columns.map(c => dataFrame.col(c).cast("string")): _*)    //create a new data frame containing only header names   import scala.collection.JavaConverters._   val headerDF = sparkSession.createDataFrame(List(Row.fromSeq(dataDF.columns.toSeq)).asJava, dataDF.schema)    //merge header names with data   headerDF.union(dataDF).write.mode(SaveMode.Overwrite).option("header", "false").csv(outputFolder)    //use hadoop FileUtil to merge all partition csv files into a single file   val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)   FileUtil.copyMerge(fs, new Path(outputFolder), fs, new Path("/folder/target.csv"), true, spark.sparkContext.hadoopConfiguration, null)

191

answered Oct 11 '22 22:10

Kang

Output the header using dataframe.schema ( val header = dataDF.schema.fieldNames.reduce(_ + "," + _))
create a file with the header on dsefs
append all the partition files (headerless) to the file in #2 using hadoop Filesystem API

answered Oct 11 '22 23:10

Sam Jacob

Related questions
                            
                                Right Arrow meanings in Scala
                            
                                What is the difference between toString and mkString in scala?
                            
                                Scala map method syntax
                            
                                Why can't a class extend traits with method of the same signature?
                            
                                How to format strings in Scala?
                            
                                How to convert an Array to a Tuple?
                            
                                IntelliJ IDEA 13: new Scala SBT project hasn't src directory structure generated
                            
                                "Cannot find an implicit ExecutionContext" error in scala.js example app.
                            
                                Scala case class update value
                            
                                Questions about Scala from a Rubyist
                            
                                How do I append to a file in Scala?
                            
                                How to call .map() on a list of pairs in Scala
                            
                                Persistent data structures in Scala
                            
                                SparkSQL - Read parquet file directly
                            
                                Why is the `unary_` prefix needed in scala?
                            
                                Non-numerical use cases for functional programming? [closed]
                            
                                found Unit: required Int. Why is the error not obvious?
                            
                                How to add tracing within a 'for' comprehension?
                            
                                Find min and max elements of array
                            
                                Scala script to copy files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Merge Spark output CSV files with a single header

Tags:

csv

scala

apache-spark

hadoop

V. Samma

People also ask

2 Answers

Kang

Sam Jacob

Recent Activity

Donate For Us