Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: saveAsTextFile without compression

By default, newer versions of Spark use compression when saving text files. For example:

val txt = sc.parallelize(List("Hello", "world", "!"))
txt.saveAsTextFile("/path/to/output")

will create files in .deflate format. It's quite easy to change compression algorithm, e.g. for .gzip:

import org.apache.hadoop.io.compress._
val txt = sc.parallelize(List("Hello", "world", "!"))
txt.saveAsTextFile("/path/to/output", classOf[GzipCodec])

But is there a way to save RDD as a plain text files, i.e. without any compression?

like image 886
ffriend Avatar asked Oct 26 '16 13:10

ffriend


People also ask

How do I write to HDFS in spark?

Write & Read CSV & TSV file from HDFS read. csv("path") , replace the path to HDFS. And Write a CSV file to HDFS using below syntax. Use the write() method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV file.

How do I save an RDD file?

You can save the RDD using saveAsObjectFile and saveAsTextFile method. Whereas you can read the RDD using textFile and sequenceFile function from SparkContext.

How do I save in spark?

Saving the text files: Spark consists of a function called saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as a directory, and multiple outputs will be produced in that directory. This is how Spark becomes able to write output from multiple codes.


1 Answers

I can see the text file in HDFS without any compression with this code.

val conf = new SparkConf().setMaster("local").setAppName("App name")
val sc = new SparkContext(conf);
sc.hadoopConfiguration.set("mapred.output.compress", "false")
val txt = sc.parallelize(List("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/path/to/save/file")

You can set all Hadoop related properties to hadoopConfiguration on sc.

Verified this code in Spark 1.5.2(scala 2.11).

like image 168
mrsrinivas Avatar answered Sep 19 '22 16:09

mrsrinivas