Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How AWS Athena deals with single line JSONs?

I am currently using Athena along with Kinesis Firehose, Glue Crawler. Kinesis Firehose is saving JSON to single line files as below

{"name": "Jone Doe"}{"name": "Jane Doe"}{"name": "Jack Doe"}

But I noticed that the athena query select count(*) from db.names returns 1 instead of 3. After searching for the problem. I found the following document.

https://aws.amazon.com/premiumsupport/knowledge-center/select-count-query-athena-json-records/?nc1=h_ls

The article says that JSON files files should stored with new lines.

{"name": "Jone Doe"}
{"name": "Jane Doe"}
{"name": "Jack Doe"}

Is there some smart tricks to run athena query on the single line JSON files?


Update

Thanks to @Constantine, AWS Athena is performing distributed processing. Since, single-line JSON files doesn't have seperator, It can't perform distributed processing. So, You must transform the files before saving it.

Kinesis Firehose offers transformation using Lambda, I added following transformation, in order to query data from AWS Athena.

​const addNewLine = (data) => {
   const parsedData = JSON.parse(new Buffer.from(data,'base64').toString('utf8'));
   return new Buffer.from(JSON.stringify(parsedData) + '\n').toString('base64')
}
​
exports.handler = async (event, context) => {
   const output = event.records.map((record) => ({
       recordId: record.recordId,
       result: 'Ok',
       data: addNewLine(record.data),
   }));
   return { records: output };
};​

I've come up with this code through following link AWS Firehose newline Character

like image 534
astrohsy Avatar asked Jun 07 '20 16:06

astrohsy


People also ask

Does Athena work on JSON?

Amazon Athena lets you parse JSON-encoded values, extract data from JSON, search for values, and find length and size of JSON arrays. To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Can Athena query unstructured data?

You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Amazon Athena can process unstructured, semi-structured, and structured data sets. Examples include CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC.


1 Answers

I believe there is no way a file with such JSON can be processed properly because a separator is required in order to distribute work. There is no explicit information in documentation on how to provide a custom separator, and most likely it is not possible in supported JSON SerDe libraries. Besides that, there is no distinct separator between given JSON objects that is not used inside JSON itself. In fact, there is no separator at all.

However, it is possible to use Firehose Data Transformation to buffer incoming data and invoke a Lambda function with each buffer asynchronously. There are predefined Lambda blueprints, and Kinesis Firehose Processing can be used in this case to add new line characters between JSON objects.

Each transformed record is supposed to contain recordId, result and Base64 encoded data with the transformed payload. There are multiple examples of such Lambda function, e.g. this python sample in Amazon AWS samples repos on GitHub.

like image 122
Constantine Avatar answered Sep 19 '22 21:09

Constantine