Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue - how to crawl a Kinesis Firehose output folder from S3

Tags:

I have what I think should be a relatively simple use case for AWS Glue, yet I'm having a lot of trouble figuring out how to implement it.

I have a Kinesis Firehose job dumping streaming data into a S3 bucket. These files consist of a series of discrete web browsing events represented as JSON documents with varying structures (so, say, one document might have field 'date' but not field 'name', whereas another might have 'name' but not 'date').

I wish to run hourly ETL jobs on these files, the specifics of which are not relevant to the matter at hand.

I'm trying to run a S3 data catalog crawler and the problem I'm running into is that the Kinesis output format is not, itself, valid JSON, which is just baffling to me. Instead it's a bunch of JSON documents separated by a line break. The crawler can automatically identify and parse JSON files, but it cannot parse this.

I thought of writing a lambda function to 'fix' the Firehose file, triggered by its creation on the bucket, but it sounds like a cheap workaround for two pieces that should fit neatly together.

Another option would be just bypassing the data catalog altogether and doing the necessary transformations in the Glue script itself, but I have no idea how to get started on this.

Am I missing anything? Is there an easier way to parse Firehouse output files or, failing that, bypassing the need for a crawler?

cheers and thanks in advance

like image 660
pedrogfp Avatar asked Sep 24 '18 22:09

pedrogfp


People also ask

Can Kinesis firehose read from S3?

Kinesis Data Firehose currently supports Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, Datadog, NewRelic, Dynatrace, Sumologic, LogicMonitor, MongoDB, and HTTP End Point as destinations.

Can AWS Glue read from S3?

You can use AWS Glue to read CSVs from Amazon S3 and from streaming sources as well as write CSVs to Amazon S3. You can read and write bzip and gzip archives containing CSV files from S3. You configure compression behavior on the Amazon S3 connection instead of in the configuration discussed on this page.


1 Answers

It sounds like you're describing the behaviour of Kinesis Firehose, which is to concatenate multiple incoming records according to some buffering (time and size) settings, and then write the records to S3 as a single object. Firehose Data Delivery

The batching of multiple records into a single file is important if the workload will contain a large number of records, as performance (and S3 costs) for processing many small files from S3 can be less than optimal.

AWS Glue Crawlers and ETL jobs do support processing of 'JSON line' (newline delimited JSON) format.

If the crawler is failing to run please include the logs or error details (and if possible the crawler run duration, and number of tables created and updated.

I have seen a crawler fail in an instance where differences in the files being crawled forced it into a table-per-file mode, and it hit a limit on the number of tables. AWS Glue Limits

like image 142
Kyle Avatar answered Nov 15 '22 00:11

Kyle