Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a file using FileWriter to google dataproc?

Tags:

I have a java spark application where the output from the spark job needs to be collected and then saved into a csv file. This is my code below:

fileWriter = new FileWriter("gs://dataflow-exp1/google_storage_tests/20170524/outputfolder/Test.csv", true);
fileWriter.append("col1,col2,col3,col4");

When i execute the spark job in google data proc, i get the file not found exception. Also i do have read/write permissions to that folder.

java.io.FileNotFoundException: gs:/dataflow-exp1/google_storage_tests/20170524/outputfolder/Test.csv (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:133)
at java.io.FileWriter.<init>(FileWriter.java:78)
at com.src.main.MyApp.testWriteOutput(MyApp.java:72)
at com.src.main.MyApp.main(MyApp.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

It looks like the filewriter at runtime uses single slash / instead of the double slashes //after gs:. How can i solve this?

I am also open to other ways instead of FileWriter to write a file to google data proc.

like image 498
Vishnu P N Avatar asked May 24 '17 10:05

Vishnu P N


1 Answers

Dataproc installs a Hadoop FileSystem connector for GCS which is accessible from Spark; in general, things in Hadoop or Spark should build on top of that interface, which is not automatically compatible with basic Java File interfaces. You should do something like:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;

Path outputPath = new Path("gs://dataflow-exp1/google_storage_tests/20170524/outputfolder/Test.csv");
OutputStream out = outputPath.getFileSystem(new Configuration()).create(outputPath);

And then adapt it for whatever writer interfaces you need.

like image 62
Dennis Huo Avatar answered Sep 22 '22 10:09

Dennis Huo