I have a java spark application where the output from the spark job needs to be collected and then saved into a csv file. This is my code below:
fileWriter = new FileWriter("gs://dataflow-exp1/google_storage_tests/20170524/outputfolder/Test.csv", true);
fileWriter.append("col1,col2,col3,col4");
When i execute the spark job in google data proc, i get the file not found exception. Also i do have read/write permissions to that folder.
java.io.FileNotFoundException: gs:/dataflow-exp1/google_storage_tests/20170524/outputfolder/Test.csv (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:133)
at java.io.FileWriter.<init>(FileWriter.java:78)
at com.src.main.MyApp.testWriteOutput(MyApp.java:72)
at com.src.main.MyApp.main(MyApp.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
It looks like the filewriter at runtime uses single slash /
instead of the double slashes //
after gs:
. How can i solve this?
I am also open to other ways instead of FileWriter to write a file to google data proc.
Dataproc installs a Hadoop FileSystem connector for GCS which is accessible from Spark; in general, things in Hadoop or Spark should build on top of that interface, which is not automatically compatible with basic Java File interfaces. You should do something like:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
Path outputPath = new Path("gs://dataflow-exp1/google_storage_tests/20170524/outputfolder/Test.csv");
OutputStream out = outputPath.getFileSystem(new Configuration()).create(outputPath);
And then adapt it for whatever writer interfaces you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With