how to make saveAsTextFile NOT split output into multiple file?

Tags:

apache-spark

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.

val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap) year.saveAsTextFile("year")

Does the number of outputs correspond to the number of reducers it uses?
Does this mean the output is compressed?
I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.

724

asked Jun 23 '14 16:06

1 Answers

The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with

val arr = year.collect()

And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.

If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.

144

answered Oct 27 '22 17:10

aaronman

Related questions
                            
                                How to use JDBC source to write and read data in (Py)Spark?
                            
                                How to transform Scala collection of Option[X] to collection of X
                            
                                Get Option value or throw an exception
                            
                                Catching multiple exceptions at once in Scala
                            
                                how to filter out a null value from spark dataframe
                            
                                What is "polymorphism a la carte" and how can I benefit from it?
                            
                                What is the purpose of type ascriptions in Scala?
                            
                                scala - Any vs underscore in generics
                            
                                Load Scala file into interpreter to use functions?
                            
                                Akka in Scala, exclamation mark and question mark
                            
                                Doing HTTP request in Scala
                            
                                Is Scala's actors similar to Go's coroutines?
                            
                                How does Scala's apply() method magic work?
                            
                                Why "avoid method overloading"?
                            
                                How to take input from a user in Scala?
                            
                                How in Scala to find unique items in List
                            
                                Mixins vs composition in scala
                            
                                Scala classOf for type parameter
                            
                                Converting mutable to immutable map
                            
                                How to pivot Spark DataFrame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to make saveAsTextFile NOT split output into multiple file?

Tags:

scala

apache-spark

user2773013

People also ask

1 Answers

aaronman

Recent Activity

Donate For Us