This is the result I get from my pyspark job in AWS GLUE
{a:1,b:7}
{a:1,b:9}
{a:1,b:3}
but I need to write this data on s3 and send it to an API in JSON array format
[
{a:1,b:2},
{a:1,b:7},
{a:1,b:9},
{a:1,b:3}
]
I tried converting my output to DataFrame and then applied
toJSON()
results = mapped_dyF.toDF()
jsonResults = results.toJSON().collect()
but now unable to write back the result on s3 with 'write_dynamic_frame.from_options'
as it requires a DF but my'jsonResults' is no longer a DataFrame now.
In order to put it in JSON array format I usually do the following: df --> DataFrame containing the original data.
if df.count() > 0:
# Build the json file
data = list()
for row in df.collect():
data.append({"a": row['a'],
"b" : row['b']
})
I haven't use the Glue write_dynamic_frame.from_options in this case but I use boto3 to save the file:
import boto3
import json
s3 = boto3.resource('s3')
# Dump the json file to s3 bucket
filename = '/{0}_batch_{1}.json'.format(str(uuid.uuid4()))
obj = s3.Object(bucket_name, filename)
obj.put(Body=json.dumps(data))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With