Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon MapReduce best practices for logs analysis

I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent.

Tons of logs generated every hour and that number likely to be increased dramatically in near future - so processing that kind of data in distributed manner via Amazon Elastic MapReduce sounds reasonable.

Right now I'm ready with mappers and reducers to process my data and tested the whole process with the following flow:

  • uploaded mappers, reducers and data to Amazon S3
  • configured appropriate job and processed it successfully
  • downloaded aggregated results from Amazon S3 to my server and inserted them into MySQL database by running CLI script

I've done that manually according to thousands of tutorials that are googlable on the Internet about Amazon ERM.

What should I do next? What is a best approach to automate this process?

  • Should I control Amazon EMR jobTracker via API?
  • How can I make sure my logs will not be processed twice?
  • What is the best way to move processed files to archive?
  • What is the best approach to insert results into PostgreSQL/MySQL?
  • How data for the jobs should be laid out in input/output directories?
  • Should I create a new EMR job each time using the API?
  • What is the best approach to upload raw logs to Amazon S3?
  • Can anyone share their setup of the data processing flow?
  • How to control file uploads and jobs completions?

I think that this topic can be useful for many people who try to process access logs with Amazon Elastic MapReduce but were not able to find good materials and/or best practices.

UPD: Just to clarify here is the single final question:

What are best practices for logs processing powered by Amazon Elastic MapReduce?

Related posts:

Getting data in and out of Elastic MapReduce HDFS

like image 709
webdevbyjoss Avatar asked Mar 23 '12 11:03

webdevbyjoss


People also ask

Does AWS use MapReduce?

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark , on AWS to process and analyze vast amounts of data.

How is Amazon's Elastic Map Reduce?

Amazon EMR (previously known as Amazon Elastic MapReduce) is an Amazon Web Services (AWS) tool for big data processing and analysis. Amazon markets EMR as an expandable, low-configuration service that provides an alternative to running on-premises cluster computing.

How is Amazon Elastic MapReduce different from a traditional database?

How is Amazon's Elastic Map Reduce (EMR) different from a traditional database? O Queries are run in real time O Big data is stored in large object tables O Queries are dynamic O It applies the schema at the time of the query​ See what the community says and unlock a badge.


1 Answers

That's a very very wide open question, but here are some thoughts you could consider:

  • Using Amazon SQS: this is a distributed queue, and is very useful for workflow management, you cna have a process that writes to the queue as soon as a log is available, and another who reads from it, processes the log described in the queue message, and deletes it when it's done processing. This would ensure that logs are processed only once.
  • Apache Flume as you mentionned is very useful for log aggregation. This is something you should consider, even if you don't need real-time, as it gives you at the very least a standardized aggregation process.
  • Amazon recently release SimpleWorkFlow. I have just started looking into it, but that sounds promising to manage every step of your data pipeline.

Hope that gives you some clues.

like image 72
Charles Menguy Avatar answered Sep 22 '22 19:09

Charles Menguy