Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop hive/pig install in Amazon Data Pipeline?

I don't need Hive or Pig, and Amazon Data Pipeline by default installs them on any EMR cluster it spins up. This makes testing take longer than it should. Any ideas on how to disable to install?

like image 430
anvitron Avatar asked Jan 17 '14 18:01

anvitron


1 Answers

This is not possible as of today.

The only workaround would be to launch a small EMR cluster that you use for testing (like with single master - m1.small). Then use it with 'workergroup' rather than 'runsOn'.

Depending on type of activities you want to use, the workergroup field might or might not be supported. But you can always wrap everything in a script (python, shell or blah) and use it with ShellCommandActivity.


Update (correctly reminded by ChristopherB):

From 3.x AMI version, Hive and Pig is bundled in the AMI itself. So the steps do not pull any new packages from S3 but only activate the daemons on master node. So unless you are worried about them consuming your instance resources (CPU, memory etc), it should be okay. They would not take noticable time to run.

like image 160
panther Avatar answered Oct 30 '22 06:10

panther