How AWS Athena deals with single line JSONs?

Tags:

I am currently using Athena along with Kinesis Firehose, Glue Crawler. Kinesis Firehose is saving JSON to single line files as below

{"name": "Jone Doe"}{"name": "Jane Doe"}{"name": "Jack Doe"}

But I noticed that the athena query select count(*) from db.names returns 1 instead of 3. After searching for the problem. I found the following document.

https://aws.amazon.com/premiumsupport/knowledge-center/select-count-query-athena-json-records/?nc1=h_ls

The article says that JSON files files should stored with new lines.

{"name": "Jone Doe"}
{"name": "Jane Doe"}
{"name": "Jack Doe"}

Is there some smart tricks to run athena query on the single line JSON files?

Update

Thanks to @Constantine, AWS Athena is performing distributed processing. Since, single-line JSON files doesn't have seperator, It can't perform distributed processing. So, You must transform the files before saving it.

Kinesis Firehose offers transformation using Lambda, I added following transformation, in order to query data from AWS Athena.

const addNewLine = (data) => {
   const parsedData = JSON.parse(new Buffer.from(data,'base64').toString('utf8'));
   return new Buffer.from(JSON.stringify(parsedData) + '\n').toString('base64')
}

exports.handler = async (event, context) => {
   const output = event.records.map((record) => ({
       recordId: record.recordId,
       result: 'Ok',
       data: addNewLine(record.data),
   }));
   return { records: output };
};

I've come up with this code through following link AWS Firehose newline Character

534

asked Jun 07 '20 16:06

astrohsy

1 Answers

I believe there is no way a file with such JSON can be processed properly because a separator is required in order to distribute work. There is no explicit information in documentation on how to provide a custom separator, and most likely it is not possible in supported JSON SerDe libraries. Besides that, there is no distinct separator between given JSON objects that is not used inside JSON itself. In fact, there is no separator at all.

However, it is possible to use Firehose Data Transformation to buffer incoming data and invoke a Lambda function with each buffer asynchronously. There are predefined Lambda blueprints, and Kinesis Firehose Processing can be used in this case to add new line characters between JSON objects.

Each transformed record is supposed to contain recordId, result and Base64 encoded data with the transformed payload. There are multiple examples of such Lambda function, e.g. this python sample in Amazon AWS samples repos on GitHub.

122

answered Sep 19 '22 21:09

Constantine

Related questions
                            
                                AWS Glue convert files from JSON to Parquet with same partitions as source table
                            
                                Access AWS Glue from local Spark
                            
                                AWS Glue: Crawler does not recognize Timestamp columns in CSV format
                            
                                AWS Glue jobs not writing to S3
                            
                                Read Headers from Data Source in an AWS Glue Job
                            
                                HIVE_PARTITION_SCHEMA_MISMATCH
                            
                                How to view AWS Glue Spark UI
                            
                                AWS Glue: Removing quote character from a CSV file while writing
                            
                                Use external table redshift spectrum defined in glue data catalog
                            
                                How to define nested array to ingest data and convert?
                            
                                AWS Glue: How to add a column with the source filename in the output?
                            
                                use SQL inside AWS Glue pySpark script
                            
                                How to change column names of autodetected partitions created by Glue Crawler?
                            
                                Overwrite MySQL tables with AWS Glue
                            
                                AWS Glue Access denied for crawler with administrator policy attached
                            
                                "GlueArgumentError: argument --input_file_path is required"
                            
                                Is there any way to trigger a AWS Lambda function at the end of an AWS Glue job?
                            
                                How to overcome Spark "No Space left on the device" error in AWS Glue Job
                            
                                AWS Glue Crawler Creates Partition and File Tables

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How AWS Athena deals with single line JSONs?

Tags:

amazon-kinesis-firehose

amazon-athena

aws-glue

Update

astrohsy

People also ask

1 Answers

Constantine

Recent Activity

Donate For Us