I have read a text file in Spark using the command <pre class="prettyprint"><code>val data = sc.textFile("/path/to/my/file/part-0000[0-4]") </code></pre> I would like to add a new line as a header of my file. Is there a way to do that without turning the RDD into an Array? Thank you!

"Part" files are automatically handled as a file set. <pre class="prettyprint"><code>val data = sc.textFile("/path/to/my/file") // Will read all parts. </code></pre> Just add the header and write it out: <pre class="prettyprint"><code>val header = sc.parallelize(Seq("...header...")) val withHeader = header ++ data withHeader.saveAsTextFile("/path/to/my/modified-file") </code></pre> Note that because this has to read and write all the data, it will be quite a bit slower than what you may intuitively expect. (After all you're just adding a single new line!) For this reason and others, you may be better off not adding this header, and instead storing the metadata (list of columns) separately from the data.

Add a new line to a text file in Spark

Tags:

scala

apache-spark

I have read a text file in Spark using the command

val data = sc.textFile("/path/to/my/file/part-0000[0-4]")

I would like to add a new line as a header of my file. Is there a way to do that without turning the RDD into an Array?

Thank you!

332

asked Apr 28 '15 08:04

amarchin

1 Answers

"Part" files are automatically handled as a file set.

val data = sc.textFile("/path/to/my/file") // Will read all parts.

Just add the header and write it out:

val header = sc.parallelize(Seq("...header..."))
val withHeader = header ++ data
withHeader.saveAsTextFile("/path/to/my/modified-file")

Note that because this has to read and write all the data, it will be quite a bit slower than what you may intuitively expect. (After all you're just adding a single new line!) For this reason and others, you may be better off not adding this header, and instead storing the metadata (list of columns) separately from the data.

156

answered Nov 13 '22 17:11

Daniel Darabos

Related questions
                            
                                Creating Read[T] and Write[T] for Abstract Class
                            
                                How to control the way Swagger generates the model/schema for a type
                            
                                Running tests from jar with "sbt testOnly" in SBT?
                            
                                Convert a Seq[String] to a case class in a typesafe way
                            
                                Fast and safe conversion from string to numeric types
                            
                                Is there any consideration for maven project and sbt project and play framework to share one single repository?
                            
                                Why does this code typecheck in Scala 2.11 and what can I do about it?
                            
                                Why does subproject not compile after migrating from 2.2 to 2.3?
                            
                                Scala Akka Logging with SLF4J MDC
                            
                                Play framework, Scala: authenticate User by Role
                            
                                Create a temporary file from a base64 string with rapture-io
                            
                                Convert RDD of Vector in LabeledPoint using Scala - MLLib in Apache Spark
                            
                                Usage of gatling feeders
                            
                                Scala - console based development workflow
                            
                                Why is NoClassDefFoundError thrown with "run" but works fine with "dist"?
                            
                                How to set-up the sbt-proguard plugin in Build.scala
                            
                                Memory efficient way of union a sequence of RDDs from Files in Apache Spark
                            
                                Type for Traversable that maps to same kind of Traversable
                            
                                What is the preferred way to avoid SQL injections in Spark-SQL (on Hive)
                            
                                why do I get "The requested resource could not be found." when accessing simple spray route?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With