I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent. Tons of logs generated every hour and that number likely to be increased dramatically in near future - so processing that kind of data in distributed manner via Amazon Elastic MapReduce sounds reasonable. Right now I'm ready with mappers and reducers to process my data and tested the whole process with the following flow: <ul> <li>uploaded mappers, reducers and data to Amazon S3</li> <li>configured appropriate job and processed it successfully</li> <li>downloaded aggregated results from Amazon S3 to my server and inserted them into MySQL database by running CLI script</li> </ul> I've done that manually according to thousands of tutorials that are googlable on the Internet about Amazon ERM. What should I do next? What is a best approach to automate this process? <ul> <li>Should I control Amazon EMR jobTracker via API?</li> <li>How can I make sure my logs will not be processed twice?</li> <li>What is the best way to move processed files to archive?</li> <li>What is the best approach to insert results into PostgreSQL/MySQL?</li> <li>How data for the jobs should be laid out in input/output directories?</li> <li>Should I create a new EMR job each time using the API?</li> <li>What is the best approach to upload raw logs to Amazon S3?</li> <li>Can anyone share their setup of the data processing flow?</li> <li>How to control file uploads and jobs completions?</li> </ul> I think that this topic can be useful for many people who try to process access logs with Amazon Elastic MapReduce but were not able to find good materials and/or best practices. UPD: Just to clarify here is the single final question: What are best practices for logs processing powered by Amazon Elastic MapReduce? Related posts: Getting data in and out of Elastic MapReduce HDFS

That's a very very wide open question, but here are some thoughts you could consider: <ul> <li>Using Amazon SQS: this is a distributed queue, and is very useful for workflow management, you cna have a process that writes to the queue as soon as a log is available, and another who reads from it, processes the log described in the queue message, and deletes it when it's done processing. This would ensure that logs are processed only once.</li> <li>Apache Flume as you mentionned is very useful for log aggregation. This is something you should consider, even if you don't need real-time, as it gives you at the very least a standardized aggregation process.</li> <li>Amazon recently release SimpleWorkFlow. I have just started looking into it, but that sounds promising to manage every step of your data pipeline.</li> </ul> Hope that gives you some clues.

Amazon MapReduce best practices for logs analysis

Tags:

logging

amazon-s3

hadoop

amazon-emr

hadoop-streaming

I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent.

Tons of logs generated every hour and that number likely to be increased dramatically in near future - so processing that kind of data in distributed manner via Amazon Elastic MapReduce sounds reasonable.

Right now I'm ready with mappers and reducers to process my data and tested the whole process with the following flow:

uploaded mappers, reducers and data to Amazon S3
configured appropriate job and processed it successfully
downloaded aggregated results from Amazon S3 to my server and inserted them into MySQL database by running CLI script

I've done that manually according to thousands of tutorials that are googlable on the Internet about Amazon ERM.

What should I do next? What is a best approach to automate this process?

Should I control Amazon EMR jobTracker via API?
How can I make sure my logs will not be processed twice?
What is the best way to move processed files to archive?
What is the best approach to insert results into PostgreSQL/MySQL?
How data for the jobs should be laid out in input/output directories?
Should I create a new EMR job each time using the API?
What is the best approach to upload raw logs to Amazon S3?
Can anyone share their setup of the data processing flow?
How to control file uploads and jobs completions?

I think that this topic can be useful for many people who try to process access logs with Amazon Elastic MapReduce but were not able to find good materials and/or best practices.

UPD: Just to clarify here is the single final question:

What are best practices for logs processing powered by Amazon Elastic MapReduce?

Getting data in and out of Elastic MapReduce HDFS

709

asked Mar 23 '12 11:03

webdevbyjoss

1 Answers

That's a very very wide open question, but here are some thoughts you could consider:

Using Amazon SQS: this is a distributed queue, and is very useful for workflow management, you cna have a process that writes to the queue as soon as a log is available, and another who reads from it, processes the log described in the queue message, and deletes it when it's done processing. This would ensure that logs are processed only once.
Apache Flume as you mentionned is very useful for log aggregation. This is something you should consider, even if you don't need real-time, as it gives you at the very least a standardized aggregation process.
Amazon recently release SimpleWorkFlow. I have just started looking into it, but that sounds promising to manage every step of your data pipeline.

Hope that gives you some clues.

answered Sep 22 '22 19:09

Charles Menguy

Related questions
                            
                                Accessing files in hadoop distributed cache
                            
                                Hive Job failed with return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask and Query Performance
                            
                                Spark SQL unable to complete writing Parquet data with a large number of shards
                            
                                hadoop Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit
                            
                                Spark driver disassociated and removed by the master
                            
                                Using hive table over parquet in Pig
                            
                                TIMESTAMP format issue in HIVE
                            
                                Spark: saveAsTextFile() only creating SUCCESS file and no part file when writing to local filesystem
                            
                                Unable to load libhdfs when using pyarrow
                            
                                Reading data from S3 using pyspark throws java.lang.NumberFormatException: For input string: "100M"
                            
                                WARN snappy.LoadSnappy: Snappy native library not loaded
                            
                                Saving garbage collection logs into ${yarn.nodemanager.log-dirs}/application_${appid}/container_${contid} for mappers and reducers on Hadoop Yarn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With