I have a large text file containing JSON objects on Amazon S3. I am planning to process this data using Spark on Amazon EMR.
Here are my questions:
This should cover #1, as long as you're using pyspark:
#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "MY-ACCESS-KEY")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "MY-SECRET-ACCESS-KEY")
#Retrieve the data
my_data = sc.textFile("s3n://my-bucket-name/my-key")
my_data.count() #Count all rows
my_data.take(20) #Take the first 20 rows
#Parse it
import json
my_data.map(lambda x: json.loads(x)).take(20) #Take the first 20 rows of json-parsed content
Note the s3 address is s3n://
, not s3://
. This is a legacy thing from hadoop.
Also, my-key
can point to a whole S3 directory*. If you're using a spark cluster, importing several medium-sized files is usually faster than a single big one.
For #2 and #3, I'd suggest looking up spark's parquet support. You can also save text back to s3:
my_data.map(lambda x: json.dumps(x)).saveAsTextFile('s3://my-bucket-name/my-new-key')
Not knowing the size of your dataset and the computational complexity of your pipeline, I can't say which way of storing intermediate data to S3 will be the best use of your resources.
*S3 doesn't really have directories, but you know what I mean.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With