Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save a spark RDD to the local file system using Java

I have a RDD that is generated using Spark. Now if I write this RDD to a csv file, I am provided with some methods like "saveAsTextFile()" which outputs a csv file to the HDFS.

I want to write the file to my local file system so that my SSIS process can pick the files from the system and load them into the DB.

I am currently unable to use sqoop.

Is it somewhere possible in Java other than writing shell scripts to do that.

Any clarity needed, please let know.

like image 527
Kanav Sharma Avatar asked Jul 06 '15 06:07

Kanav Sharma


People also ask

How do I save a RDD file?

You can save the RDD using saveAsObjectFile and saveAsTextFile method. Whereas you can read the RDD using textFile and sequenceFile function from SparkContext.

Can Spark write to local file system?

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. Text file RDDs can be created using SparkContext 's textFile method.

How are RDDs stored?

2.3. The RDDs store data in memory for fast access to data during computation and provide fault tolerance [110]. An RDD is an immutable distributed collection of key–value pairs of data, stored across nodes in the cluster. The RDD can be operated in parallel.


1 Answers

saveAsTextFile is able to take in local file system paths (e.g. file:///tmp/magic/...). However, if your running on a distributed cluster, you most likely want to collect() the data back to the cluster and then save it with standard file operations.

like image 147
Holden Avatar answered Oct 12 '22 10:10

Holden