I'm trying to read a txt file from S3 with Spark, but I'm getting thhis error:
No FileSystem for scheme: s3
This is my code:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("first")
sc = SparkContext(conf=conf)
data = sc.textFile("s3://"+AWS_ACCESS_KEY+":" + AWS_SECRET_KEY + "@/aaa/aaa/aaa.txt")
header = data.first()
This is the full traceback:
An error occurred while calling o25.partitions.
: java.io.IOException: No FileSystem for scheme: s3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
How can I fix this?
If you are using a local machine you can use boto3:
s3 = boto3.resource('s3')
# get a handle on the bucket that holds your file
bucket = s3.Bucket('yourBucket')
# get a handle on the object you want (i.e. your file)
obj = bucket.Object(key='yourFile.extension')
# get the object
response = obj.get()
# read the contents of the file and split it into a list of lines
lines = response[u'Body'].read().split('\n')
(do not forget to setup your AWS S3 credentials).
Another clean solution if you are using an AWS Virtual Machine (EC2) would be granting S3 permissions to your EC2 and launching pyspark with this command:
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2
If you add other packages, make sure the format is: 'groupId:artifactId:version' and the packages are separated by commas.
If you are using pyspark from Jupyter Notebooks this will work:
import os
import pyspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
filePath = "s3a://yourBucket/yourFile.parquet"
df = sqlContext.read.parquet(filePath) # Parquet file read example
If you're using a jupyter notebook, you must two files to the class path for spark:
/home/ec2-user/anaconda3/envs/ENV-XXX/lib/python3.6/site-packages/pyspark/jars
the two files are :
The above answers are correct regarding the need to specify Hadoop <-> AWS dependencies.
The answers do not include the newer versions of Spark, so I will post whatever worked for me, especially that it has changed as of Spark 3.2.x
when spark upgraded to Hadoop 3.0.
--packages
: org.apache.hadoop:hadoop-aws:2.10.2,org.apache.hadoop:hadoop-client:2.10.2
--exclude-packages
: com.google.guava:guava
now it's released together with spark, so use the same version as your spark
--packages
: org.apache.spark:spark-hadoop-cloud_2.12:3.2.0
According to spark documentation you should use the org.apache.spark:hadoop-cloud_2.12:<SPARK_VERSION>
library. The problem with this is that this library does not exist in the central maven repository.
spark-submit
Use --packages
and --exclude-packages
parameters
SparkSession.builder
Use spark.jars.packages
and spark.jars.excludes
spark configs
spark = (
SparkSession
.builder
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.10.2,org.apache.hadoop:hadoop-client:2.10.2")
.config("spark.jars.excludes", "com.google.guava:guava")
.getOrCreate()
)
s3
vs s3a
The above adds the S3AFileSystem
to spark's classpath. When you set this spark config not only s3a://...
but also the s3://...
paths will work:
spark.hadoop.fs.s3.impl
: org.apache.hadoop.fs.s3a.S3AFileSystem
set is using SparkSession.builder.config()
of via --conf
when using spark-submit
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With