Trying to read a file located in S3 using spark-shell: <pre class="prettyprint"><code>scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log") lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12 scala> myRdd.count java.io.IOException: No FileSystem for scheme: s3n at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) ... etc ... </code></pre> The IOException: No FileSystem for scheme: s3n error occurred with: <ul> <li>Spark 1.31 or 1.40 on dev machine (no Hadoop libs)</li> <li>Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box</li> <li>Using s3:// or s3n:// scheme</li> </ul> What is the cause of this error? Missing dependency, Missing configuration, or mis-use of <code>sc.textFile()</code>? Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.

Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK. <code>sc.textFile("s3n://bucketname/Filename")</code> now raises another error: <pre class="prettyprint"><code>java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). </code></pre> The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs). <pre class="prettyprint"><code>scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey@zpub01/SafeAndSound_Lyrics.txt") lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21 scala> lyrics.count res1: Long = 9 </code></pre> Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private. <pre class="prettyprint"><code>sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/" val myRDD = sc.textFile("s3n://myBucket/MyFilePattern") myRDD.count </code></pre>

Spark read file from S3 using sc.textFile ("s3n://...)

Tags:

java

scala

apache-spark

rdd

hortonworks-data-platform

Trying to read a file located in S3 using spark-shell:

scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log") lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12  scala> myRdd.count java.io.IOException: No FileSystem for scheme: s3n     at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)     at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)     at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)     ... etc ...

The IOException: No FileSystem for scheme: s3n error occurred with:

Spark 1.31 or 1.40 on dev machine (no Hadoop libs)
Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box
Using s3:// or s3n:// scheme

What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()?

Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.

265

asked Jun 15 '15 17:06

Polymerase

2 Answers

Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.

sc.textFile("s3n://bucketname/Filename") now raises another error:

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).

scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey@zpub01/SafeAndSound_Lyrics.txt") lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21  scala> lyrics.count res1: Long = 9

Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/" val myRDD = sc.textFile("s3n://myBucket/MyFilePattern") myRDD.count

answered Sep 22 '22 07:09

Polymerase

Despite that this question has already an accepted answer, I think that the exact details of why this is happening are still missing. So I think there might be a place for one more answer.

If you add the required hadoop-aws dependency, your code should work.

Starting Hadoop 2.6.0, s3 FS connector has been moved to a separate library called hadoop-aws. There is also a Jira for that: Move s3-related FS connector code to hadoop-aws.

This means that any version of spark, that has been built against Hadoop 2.6.0 or newer will have to use another external dependency to be able to connect to the S3 File System.
Here is an sbt example that I have tried and is working as expected using Apache Spark 1.6.2 built against Hadoop 2.6.0:

libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0"

In my case, I encountered some dependencies issues, so I resolved by adding exclusion:

libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0" exclude("tomcat", "jasper-compiler") excludeAll ExclusionRule(organization = "javax.servlet")

On other related note, I have yet to try it, but that it is recommended to use "s3a" and not "s3n" filesystem starting Hadoop 2.6.0.

The third generation, s3a: filesystem. Designed to be a switch in replacement for s3n:, this filesystem binding supports larger files and promises higher performance.

answered Sep 21 '22 07:09

Sergey Bahchissaraitsev

Related questions
                            
                                How to get an InputStream from a BufferedImage?
                            
                                Create a new TextView programmatically then display it below another TextView
                            
                                How to sum values in a Map with a stream?
                            
                                Java reflection: Is the order of class fields and methods standardized?
                            
                                Does groovy have an easy way to get a filename without the extension?
                            
                                How to map JSON field names to different object field names?
                            
                                Can not download sources with Intellij Idea community 12.1.4 and maven 3.0.5
                            
                                Android HttpPost: how to get the result
                            
                                Commons Lang StringUtils.replace performance vs String.replace
                            
                                LocalDateTime to ZonedDateTime
                            
                                Disable caching in JPA (eclipselink)
                            
                                Getting IP address of client [duplicate]
                            
                                JAXWS — how to change the endpoint address [duplicate]
                            
                                How to deal with missing src/test/java source folder in Android/Maven project?
                            
                                iText - add content to existing PDF file
                            
                                IntelliJ IDE gives error when using Try-Catch with Resources
                            
                                Closing Streams in Java
                            
                                jsoup posting and cookie
                            
                                Most efficient solution for reading CLOB to String, and String to CLOB in Java?
                            
                                Updating property value in properties file without deleting other values [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark read file from S3 using sc.textFile ("s3n://...)

Tags:

java

scala

apache-spark

rdd

hortonworks-data-platform

Polymerase

People also ask

2 Answers

Polymerase

Sergey Bahchissaraitsev

Recent Activity

Donate For Us