read csv from S3 as spark dataframe using pyspark (spark 2.4)

Question

I would like to read a csv-file from s3 (s3://test-bucket/testkey.csv) as a spark dataframe using pyspark. My cluster runs on spark 2.4.

I don't need to take any infer_schema, credentials a.o.t. into account. And the csv-file is not to be crawled as a glue table.

Could you please paste your pyspark code that is based on spark session and converts to csv to a spark dataframe here?

Many thanks in advance and best regards

ravi malhotra · Accepted Answer

You can set certain properties as below

spark = SparkSession.builder \
            .appName("app_name") \
            .getOrCreate()

spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "mykey")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "mysecret")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "eu-west-3.amazonaws.com")

After this, you can read the files

csvDf = spark.read.csv("s3a://path/to/files/*.csv")
jsonDf = spark.read.json("s3a://path/to/files/*.json")

read csv from S3 as spark dataframe using pyspark (spark 2.4)

Tags:

csv

amazon-web-services

amazon-s3

pyspark

C.Tomas

1 Answers

ravi malhotra

Recent Activity

Donate For Us

read csv from S3 as spark dataframe using pyspark (spark 2.4)

Tags:

csv

amazon-web-services

amazon-s3

pyspark

C.Tomas

1 Answers

ravi malhotra

Related questions

Recent Activity

Donate For Us