saveAsTextFile method in spark

Tags:

apache-spark

In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use

val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))

I do nothing to the log but save it as a text file by using

log.coalesce(1, true).saveAsTextFile(args(args.size - 1))

but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file?

729

asked Dec 31 '14 08:12

1 Answers

Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. The number of output files depends on the number of reducers.

How to "solve" it in Hadoop: merge output files after reduce phase

How to "solve" in Spark: how to make saveAsTextFile NOT split output into multiple file?

A good info you can get also here: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html

So, you were right about coalesce(1,true). However, it is very inefficient. Interesting is that (as @climbage mentioned in his remark) your code is working if you run it locally.

What you might try is to read the files first and then save the output.

...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
   val file = sc.textFile(args(i))       
   file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")

Note: this code is also extremely inefficient and working for small files only!!! You need to come up with a better code. I wouldn't try to reduce number of file but process multiple outputs files instead.

149

answered Nov 18 '22 22:11

xhudik

Related questions
                            
                                Angularjs + OAuth + Play 2.0
                            
                                Is it possible to avoid mutable state using Cucumber-jvm Scala?
                            
                                Is there an elegant way to foldLeft on a growing scala.collections.mutable.Queue?
                            
                                Abstracting over Float, Double, and BigDecimal in Scala
                            
                                NUMA awareness of JVM
                            
                                Adding OAuth to a Scalatra web service
                            
                                How to recognize scala constructor parameter 'fields' with no underlying java field?
                            
                                Scala 22param limit trying to find a workaround and still use for comprehensions instead of plain SQL in Slick
                            
                                Use `this` in a generated macro method
                            
                                Type of a function with Implicit parameters in Scala
                            
                                SBT Multi-Project Build with dynamic external projects?
                            
                                SBT: How to trigger separate actions when files change in two separate subprojects
                            
                                Concise syntax for function composition in Scala?
                            
                                Structural typing in Scala: use abstract type in refinement
                            
                                In what way is Scala's Option fold a catamorphism?
                            
                                Typesafe Play WS as dependency in SBT project
                            
                                Is it safe to send SIGTERM to JVM
                            
                                How does Phantom DSL for Cassandra actually connect?
                            
                                Proper usage of Futures in parallel calculations
                            
                                Scala traits mixin order and super call

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

saveAsTextFile method in spark

Tags:

scala

apache-spark

kemiya

People also ask

1 Answers

xhudik

Recent Activity

Donate For Us