By default, newer versions of Spark use compression when saving text files. For example:
val txt = sc.parallelize(List("Hello", "world", "!"))
txt.saveAsTextFile("/path/to/output")
will create files in .deflate format. It's quite easy to change compression algorithm, e.g. for .gzip:
import org.apache.hadoop.io.compress._
val txt = sc.parallelize(List("Hello", "world", "!"))
txt.saveAsTextFile("/path/to/output", classOf[GzipCodec])
But is there a way to save RDD as a plain text files, i.e. without any compression?
Write & Read CSV & TSV file from HDFS read. csv("path") , replace the path to HDFS. And Write a CSV file to HDFS using below syntax. Use the write() method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV file.
You can save the RDD using saveAsObjectFile and saveAsTextFile method. Whereas you can read the RDD using textFile and sequenceFile function from SparkContext.
Saving the text files: Spark consists of a function called saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as a directory, and multiple outputs will be produced in that directory. This is how Spark becomes able to write output from multiple codes.
I can see the text file in HDFS without any compression with this code.
val conf = new SparkConf().setMaster("local").setAppName("App name")
val sc = new SparkContext(conf);
sc.hadoopConfiguration.set("mapred.output.compress", "false")
val txt = sc.parallelize(List("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/path/to/save/file")
You can set all Hadoop related properties to hadoopConfiguration on sc.
Verified this code in Spark 1.5.2(scala 2.11).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With