Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spring-Batch for a massive nightly / hourly Hive / MySQL data processing

I'm looking into replacing a bunch of Python ETL scripts that perform a nightly / hourly data summary and statistics gathering on a massive amount of data.

What I'd like to achieve is

  • Robustness - a failing job / step should be automatically restarted. In some cases I'd like to execute a recovery step instead.
  • The framework must be able to recover from crashes. I guess some persistence would be needed here.
  • Monitoring - I need to be able to monitor the progress of jobs / steps, and preferably see history and statistics with regards to the performance.
  • Traceability - I must be able to understand the state of the executions
  • Manual intervention - nice to have... being able to start / stop / pause a job from an API / UI / command line.
  • Simplicity - I prefer not to get angry looks from my colleagues when I introduce the replacement... Having a simple and easy to understand API is a requirement.

The current scripts do the following:

  • Collect text logs from many machines, and push them into Hadoop DFS. We may use Flume for this step in the future (see http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/).
  • Perform Hive summary queries on the data, and insert (overwrite) to new Hive tables / partitions.
  • Extract the new summaries data into files, and load (merge) into MySql tables. This is data needed later for on-line reports.
  • Perform additional joins on the newly added MySql data (from MySql tables), and update the data.

My idea is to replace the scripts with spring-batch. I also looked into Scriptella, but I believe it is too 'simple' for this case.

since I saw some bad vibes on Spring-Batch (mostly old posts) I'm hoping to get some inputs here. I also haven't seen much about spring-batch and Hive integration, which is troublesome.

like image 280
Eran Harel Avatar asked Aug 16 '10 16:08

Eran Harel


2 Answers

If you want to stay within the Hadoop ecosystem, I'd highly recommend checking out Oozie to automate your workflow. We (Cloudera) provide a packaged version of Oozie that you can use to get started. See our recent blog post for more details.

like image 132
Jeff Hammerbacher Avatar answered Nov 06 '22 03:11

Jeff Hammerbacher


Why not use JasperETL or Talend? Seems like the right tool for the job.

like image 1
dukethrash Avatar answered Nov 06 '22 02:11

dukethrash