AWS Glue job consuming data from external REST API

Tags:

I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help!

525

asked Jan 13 '20 09:01

dstdnk

1 Answers

Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use the requests pyhton library.

In order to save the data into S3 you can do something like this

import boto3
import json

# Initializes S3 client
s3 = boto3.resource('s3')

tweets = []
//Code that extracts tweets from API
tweets_json = json.dumps(tweets)
obj = s3.Object("my-tweets", "tweets.json")
obj.put(Body=data)

146

answered Sep 20 '22 16:09

Aida Martinez

Related questions
                            
                                (AWS) Athena: Query Results seem too short
                            
                                Resource Unavailable Error in AWS Glue Job
                            
                                Import failure of s3fs library in AWS Glue
                            
                                aws glue `ImportError: cannot import name 'S3ArnParamHandler'`
                            
                                How to run arbitrary / DDL SQL statements or stored procedures using AWS Glue
                            
                                Can I use Athena View as a source for a AWS Glue Job?
                            
                                Trouble with connection between Apache Airflow and AWS Glue
                            
                                Writing Spark DataFrame to Hive table through AWS Glue Data Cataloug
                            
                                I have an error "java.io.FileNotFoundException: No such file or directory" while trying to create a dynamic frame using a notebook in AWS Glue
                            
                                Why do I need to set the `transformation_ctx` parameter when calling transformation and sink operations for AWS Glue bookmark to work?
                            
                                how can aws glue job upload several tables in redshift
                            
                                Pipeline from AWS RDS to S3 using Glue
                            
                                How to specify join types in AWS Glue?
                            
                                How do I set multiple --conf table parameters in AWS Glue?
                            
                                HIVE_UNKNOWN_ERROR when running AWS Athena query on Glue table (RDS)
                            
                                How to import python file as module in Jupyter notebook?
                            
                                How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?
                            
                                Would someone be able provide an example of what an AWS Cloudformation AWS::GLUE::WORKFLOW template would look like?
                            
                                AWS Athena - duplicate columns due to partitionning

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Glue job consuming data from external REST API

Tags:

aws-glue

aws-glue-data-catalog

dstdnk

People also ask

1 Answers

Aida Martinez

Recent Activity

Donate For Us