Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Persisting RDD on Amazon S3

I have a large text file containing JSON objects on Amazon S3. I am planning to process this data using Spark on Amazon EMR.

Here are my questions:

  1. How do I load the text file containing JSON objects into Spark?
  2. Is it possible to persist the internal RDD representation of this data on S3, after the EMR cluster is turned-off?
  3. If I am able to persist the RDD representation, is it possible to directly load the data in RDD format next time I need to analyze the same data?
like image 830
chandola Avatar asked Dec 25 '22 07:12

chandola


1 Answers

This should cover #1, as long as you're using pyspark:

#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "MY-ACCESS-KEY")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "MY-SECRET-ACCESS-KEY")

#Retrieve the data
my_data = sc.textFile("s3n://my-bucket-name/my-key")
my_data.count() #Count all rows
my_data.take(20) #Take the first 20 rows

#Parse it
import json
my_data.map(lambda x: json.loads(x)).take(20) #Take the first 20 rows of json-parsed content

Note the s3 address is s3n://, not s3://. This is a legacy thing from hadoop.

Also, my-key can point to a whole S3 directory*. If you're using a spark cluster, importing several medium-sized files is usually faster than a single big one.

For #2 and #3, I'd suggest looking up spark's parquet support. You can also save text back to s3:

my_data.map(lambda x: json.dumps(x)).saveAsTextFile('s3://my-bucket-name/my-new-key')

Not knowing the size of your dataset and the computational complexity of your pipeline, I can't say which way of storing intermediate data to S3 will be the best use of your resources.

*S3 doesn't really have directories, but you know what I mean.

like image 55
Abe Avatar answered Jan 14 '23 09:01

Abe