Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to read and write parquet files from s3 with local spark

I'm trying to read and write parquet files from my local machine to S3 using spark. But I can't seem to configure my spark session properly to do so. Obviously there are configurations to be made, but I could not find a clear reference on how to do it.

Currently my spark session reads local parquet mocks and is defined as such:

val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate()
like image 842
dlaredod Avatar asked Jan 29 '23 02:01

dlaredod


1 Answers

I'm going to have to correct the post by himanshuIIITian slightly, (sorry).

  1. Use the s3a connector, not the older, obsolete, unmaintained, s3n. S3A is: faster, works with the newer S3 clusters (Seoul, Frankfurt, London, ...), scales better. S3N has fundamental performance issues which have only been fixed in the latest version of Hadoop by deleting that connector entirely. Move on.

  2. You cannot safely use s3 as a direct destination of a Spark query., not with the classic "FileSystem" committers available today. Write to your local file:// and then copy up the data afterwards, using the AWS CLI interface. You'll get better performance as well as the guarantees of reliable writing which you would normally expect from IO

like image 129
stevel Avatar answered Feb 16 '23 01:02

stevel