I am currently using Athena along with Kinesis Firehose
, Glue Crawler
. Kinesis Firehose
is saving JSON to single line files as below
{"name": "Jone Doe"}{"name": "Jane Doe"}{"name": "Jack Doe"}
But I noticed that the athena query select count(*) from db.names
returns 1 instead of 3. After searching for the problem. I found the following document.
https://aws.amazon.com/premiumsupport/knowledge-center/select-count-query-athena-json-records/?nc1=h_ls
The article says that JSON files files should stored with new lines.
{"name": "Jone Doe"}
{"name": "Jane Doe"}
{"name": "Jack Doe"}
Is there some smart tricks to run athena query on the single line JSON files?
Thanks to @Constantine, AWS Athena is performing distributed processing. Since, single-line JSON files doesn't have seperator, It can't perform distributed processing. So, You must transform the files before saving it.
Kinesis Firehose offers transformation using Lambda, I added following transformation, in order to query data from AWS Athena.
const addNewLine = (data) => {
const parsedData = JSON.parse(new Buffer.from(data,'base64').toString('utf8'));
return new Buffer.from(JSON.stringify(parsedData) + '\n').toString('base64')
}
exports.handler = async (event, context) => {
const output = event.records.map((record) => ({
recordId: record.recordId,
result: 'Ok',
data: addNewLine(record.data),
}));
return { records: output };
};
I've come up with this code through following link AWS Firehose newline Character
Amazon Athena lets you parse JSON-encoded values, extract data from JSON, search for values, and find length and size of JSON arrays. To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.
You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Amazon Athena can process unstructured, semi-structured, and structured data sets. Examples include CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC.
I believe there is no way a file with such JSON can be processed properly because a separator is required in order to distribute work. There is no explicit information in documentation on how to provide a custom separator, and most likely it is not possible in supported JSON SerDe libraries. Besides that, there is no distinct separator between given JSON objects that is not used inside JSON itself. In fact, there is no separator at all.
However, it is possible to use Firehose Data Transformation to buffer incoming data and invoke a Lambda function with each buffer asynchronously. There are predefined Lambda blueprints, and Kinesis Firehose Processing
can be used in this case to add new line characters between JSON objects.
Each transformed record is supposed to contain recordId
, result
and Base64 encoded data
with the transformed payload.
There are multiple examples of such Lambda function, e.g. this python sample in Amazon AWS samples repos on GitHub.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With