Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query S3 logs content using Athena or DynamoDB

I have a use case to query request url from S3 logs. Amazon has recently introduced Athena to query S3 file contents. What is the best option with respect to cost and performance?

  1. Use Athena to query S3 files for request urls
  2. Store metadata of each file with request url information in DynamoDB table for query
like image 937
Ashan Avatar asked Dec 23 '22 15:12

Ashan


1 Answers

Amazon DynamoDB would be a poor choice for running queries over web logs.

DynamoDB is super-fast, but only if you are retrieving data based upon its Primary Key ("Query"). If you are running a query against ALL data in a table (eg to find a particular IP address in a Key that is not indexed), DynamoDB will need to scan through ALL rows in the table, which takes a lot of time ("Scan"). For example, if your table is configured for 100 Reads per Second and you are scanning 10000 rows, it will take 100 seconds (100 x 100 = 10000).

Tip: Do not do full-table scans in a NoSQL database.

Amazon Athena is ideal for scanning log files! There is no need to pre-load data - simply run the query against the logs already stored in Amazon S3. Use standard SQL to find the data you're seeking. Plus, you only pay for the data that is read from disk. The file format is a bit weird, so you'll need the correct CREATE TABLE statement.

See: Using AWS Athena to query S3 Server Access Logs

Another choice is to use Amazon Redshift, which can GBs, TBs and even PBs of data across billions of rows. If you are going to run frequent queries against the log data, Redshift is great. However, being a standard SQL database, you will need to pre-load the data into Redshift. Unfortunately, Amazon S3 log files are not in CSV format, so you would need to ETL the files into a suitable format. This isn't worthwhile for occasional, ad-hoc requests.

Many people also like to use Amazon Elasticsearch Service for scanning log files. Again, the file format needs some special handling and the pipeline to load the data needs some work, but the result is near-realtime interactive analysis of your S3 log files.

See: Using the ELK stack to analyze your S3 logs

like image 108
John Rotenstein Avatar answered Jan 08 '23 12:01

John Rotenstein