I don't need Hive or Pig, and Amazon Data Pipeline by default installs them on any EMR cluster it spins up. This makes testing take longer than it should. Any ideas on how to disable to install?
This is not possible as of today.
The only workaround would be to launch a small EMR cluster that you use for testing (like with single master - m1.small). Then use it with 'workergroup' rather than 'runsOn'.
Depending on type of activities you want to use, the workergroup field might or might not be supported. But you can always wrap everything in a script (python, shell or blah) and use it with ShellCommandActivity.
Update (correctly reminded by ChristopherB):
From 3.x AMI version, Hive and Pig is bundled in the AMI itself. So the steps do not pull any new packages from S3 but only activate the daemons on master node. So unless you are worried about them consuming your instance resources (CPU, memory etc), it should be okay. They would not take noticable time to run.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With