Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

saveAsTextFile method in spark

In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use

val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))

I do nothing to the log but save it as a text file by using

log.coalesce(1, true).saveAsTextFile(args(args.size - 1))

but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file?

like image 729
kemiya Avatar asked Dec 31 '14 08:12

kemiya


People also ask

How do I write to HDFS in spark?

Write & Read CSV & TSV file from HDFS read. csv("path") , replace the path to HDFS. And Write a CSV file to HDFS using below syntax. Use the write() method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV file.

Can we store data in spark?

Spark can be used for processing datasets that larger than the aggregate memory in a cluster. Spark will attempt to store as much as data in memory and then will spill to disk. It can store part of a data set in memory and the remaining data on the disk.


1 Answers

Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. The number of output files depends on the number of reducers.

How to "solve" it in Hadoop: merge output files after reduce phase

How to "solve" in Spark: how to make saveAsTextFile NOT split output into multiple file?

A good info you can get also here: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html

So, you were right about coalesce(1,true). However, it is very inefficient. Interesting is that (as @climbage mentioned in his remark) your code is working if you run it locally.

What you might try is to read the files first and then save the output.

...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
   val file = sc.textFile(args(i))       
   file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")

Note: this code is also extremely inefficient and working for small files only!!! You need to come up with a better code. I wouldn't try to reduce number of file but process multiple outputs files instead.

like image 149
xhudik Avatar answered Nov 18 '22 22:11

xhudik