Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?

I have a Spark program (in Scala) and a SparkContext. I am writing some files with RDD's saveAsTextFile. On my local machine I can use a local file path and it works with the local file system. On my cluster it works with HDFS.

I also want to write other arbitrary files as the result of processing. I'm writing them as regular files on my local machine, but want them to go into HDFS on the cluster.

SparkContext seems to have a few file-related methods but they all seem to be inputs not outputs.

How do I do this?

like image 294
Joe Avatar asked Oct 05 '15 15:10

Joe


People also ask

Can Spark write to local file system?

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. Text file RDDs can be created using SparkContext 's textFile method.

How do I write to HDFS in Spark?

Write & Read CSV & TSV file from HDFS read. csv("path") , replace the path to HDFS. And Write a CSV file to HDFS using below syntax. Use the write() method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV file.

How do I create a text file in Spark?

text("file_name") to read a file or directory of text files into a Spark DataFrame, and dataframe. write(). text("path") to write to a text file. When reading a text file, each line becomes each row that has string “value” column by default.

Can Spark connect to HDFS?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.


1 Answers

Thanks to marios and kostya, but there are few steps to writing a text file into HDFS from Spark.

// Hadoop Config is accessible from SparkContext
val fs = FileSystem.get(sparkContext.hadoopConfiguration); 

// Output file can be created from file system.
val output = fs.create(new Path(filename));

// But BufferedOutputStream must be used to output an actual text file.
val os = BufferedOutputStream(output)

os.write("Hello World".getBytes("UTF-8"))

os.close()

Note that FSDataOutputStream, which has been suggested, is a Java serialized object output stream, not a text output stream. The writeUTF method appears to write plaint text, but it's actually a binary serialization format that includes extra bytes.

like image 151
Joe Avatar answered Sep 28 '22 16:09

Joe