Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add a new line to a text file in Spark

I have read a text file in Spark using the command

val data = sc.textFile("/path/to/my/file/part-0000[0-4]")

I would like to add a new line as a header of my file. Is there a way to do that without turning the RDD into an Array?

Thank you!

like image 332
amarchin Avatar asked Apr 28 '15 08:04

amarchin


People also ask

How to read a text file in spark?

Example: Read text file using spark.read.csv (). First, import the modules and create a spark session and then read the file with spark.read.csv (), then create columns and split the data from the txt file show into a dataframe. It is used to load text files into DataFrame. The .format () specifies the input data source format as “text”.

How to convert a txt file to a Dataframe in spark?

First, import the modules and create a spark session and then read the file with spark.read.format (), then create columns and split the data from the txt file show into a dataframe.

What is sparkcontext wholetextfiles () method?

sparkContext.wholeTextFiles () reads a text file into PairedRDD of type RDD [ (String,String)] with the key being the file path and value being contents of the file. This method also takes the path as an argument and optionally takes a number of partitions as the second argument. Let’s see a similar example with wholeTextFiles () method.

How to read text file in pyspark Dataframe?

There are three ways to read text files into PySpark DataFrame. Using spark.read.text () Using spark.read.csv () Using spark.read.format ().load () Using these we can read a single text file, multiple files, and all files from a directory into Spark DataFrame and Dataset.


1 Answers

"Part" files are automatically handled as a file set.

val data = sc.textFile("/path/to/my/file") // Will read all parts.

Just add the header and write it out:

val header = sc.parallelize(Seq("...header..."))
val withHeader = header ++ data
withHeader.saveAsTextFile("/path/to/my/modified-file")

Note that because this has to read and write all the data, it will be quite a bit slower than what you may intuitively expect. (After all you're just adding a single new line!) For this reason and others, you may be better off not adding this header, and instead storing the metadata (list of columns) separately from the data.

like image 156
Daniel Darabos Avatar answered Nov 13 '22 17:11

Daniel Darabos