Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read simple text file from Google Cloud Storage using Spark-Scala local Program

As given in the below blog,

https://cloud.google.com/blog/big-data/2016/06/google-cloud-dataproc-the-fast-easy-and-safe-way-to-try-spark-20-preview

I was trying to read file from Google Cloud Storage using Spark-scala. For that I have imported Google Cloud Storage Connector and Google Cloud Storage as below,

// https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage
compile group: 'com.google.cloud', name: 'google-cloud-storage', version: '0.7.0'

// https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector
compile group: 'com.google.cloud.bigdataoss', name: 'gcs-connector', version: '1.6.0-hadoop2'

After that created a simple scala object file like below, (Created a sparkSession)

val csvData = spark.read.csv("gs://my-bucket/project-data/csv")

But it is throwing below error,

17/03/01 20:16:02 INFO GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2
17/03/01 20:16:23 WARN HttpTransport: exception thrown while executing request
java.net.SocketTimeoutException: connect timed out
    at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
    at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
    at sun.net.www.http.HttpClient.New(HttpClient.java:308)
    at sun.net.www.http.HttpClient.New(HttpClient.java:326)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
    at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:158)
    at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:205)
    at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:70)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
    at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:317)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349)
    at test$.main(test.scala:41)
    at test.main(test.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

I have setup all the authentications as well. Not sure how the timeout is getting flashed.

Edit

I am trying to run above code through IntelliJ Idea (Windows). The JAR file for same code is working fine on Google Cloud DataProc but giving above error when I run it through local system. I have installed Spark,Scala,Google Cloud plugins in IntelliJ.

One more thing, I had created Dataproc instance and tried to connect to External IP address as given in the documentation, https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh

It was not able to connect to the server giving Timeout Error

like image 504
Shawn Avatar asked Mar 01 '17 14:03

Shawn


People also ask

How do I read a GCP bucket spark file?

Cloud Storage Connector It is a jar file, Download the Connector. Now go to shell and find the spark home directory. Copy the downloaded jar file to $SPARK_HOME/jars/ directory. Now all set for the development, let's move to Jupyter Notebook and write the code to finally access files.

Can we use PySpark in GCP?

To do that, GCP provisions a cluster for each Notebook Instance. We can execute PySpark and SparkR types of jobs from the notebook. Make sure you go through the usual configurations like Notebook Name, Region, Environment (Dataproc Hub), and Machine Configuration (we're using 2 vCPUs with 7.5 GB RAM).


1 Answers

Thank you Dennis for showing direction to the problem. Since I am using Windows OS, there is no core-site.xml because hadoop is not available for windows.

I have downloaded pre-built spark and in the code itself configured the parameter mentioned by you as given below

Created a SparkSession and using its variable configured hadoop parameter like spark.SparkContext.hadoopConfiguration.set("google.cloud.auth.service.account.json.keyfile","<KeyFile Path>") and all other parameters which we need to setup in the core-site.xml.

After setting all these, Program could access the files from Google Cloud Storage.

like image 169
Shawn Avatar answered Oct 18 '22 08:10

Shawn