Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use s3 with Apache spark 2.2 in the Spark shell

Tags:

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.

I have consulted the following resources:

Parsing files from Amazon S3 with Apache Spark

How to access s3a:// files from Apache Spark?

Hortonworks Spark 1.6 and S3

Cloudera

Custom s3 endpoints

I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.access.key=access-key  spark.hadoop.fs.s3a.secret.key=secret-key 

I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:

bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar 

In the shell, here is how I try to load data from the S3 bucket:

val p = spark.read.textFile("s3a://sparkcookbook/person") 

And here is the error that results:

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider   at java.lang.Class.forName0(Native Method)   at java.lang.Class.forName(Class.java:348)   at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)   at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)   at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)   at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)   at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)   at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) 

When I instead try to start the Spark shell as follows:

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1 

Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:

:: problems summary :: :::: ERRORS     unknown resolver null      unknown resolver null      unknown resolver null      unknown resolver null      unknown resolver null      unknown resolver null   :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS 

And here is the second:

val p = spark.read.textFile("s3a://sparkcookbook/person") java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation   at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)   at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)   at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)   at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)   at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)   at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)   at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)   at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)   at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)   at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515) 

Could someone suggest how to get this working? Thanks.

like image 939
Shafique Jamal Avatar asked Aug 18 '17 12:08

Shafique Jamal


People also ask

How do I access my S3 bucket from Spark?

If you are using PySpark to access S3 buckets, you must pass the Spark engine the right packages to use, specifically aws-java-sdk and hadoop-aws . It'll be important to identify the right package version to use. As of this writing aws-java-sdk 's 1.7. 4 version and hadoop-aws 's 2.7.

Does Spark work with S3?

With Amazon EMR release version 5.17. 0 and later, you can use S3 Select with Spark on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object.

How do I connect my Spark shell?

You can access the Spark shell by connecting to the master node with SSH and invoking spark-shell . For more information about connecting to the master node, see Connect to the master node using SSH in the Amazon EMR Management Guide. The following examples use Apache HTTP Server access logs stored in Amazon S3.


1 Answers

If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.

$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar 

After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.

like image 80
himanshuIIITian Avatar answered Sep 27 '22 21:09

himanshuIIITian