In Amazon data pipeline, I am creating activity to copy S3 to EMR using Hive. To achieve it I have to pass two input parameters into EMR job as a step. I have searched all most every data pipeline documentation but did not found the way to specify the multiple input parameters. I also talk with the AWS support team but they are also not clear about it. The way/trick they suggested also not working.
Below is my step arguments and Hive query. Please let me know if anyone has idea to achieve it.
Steps:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://gwbpipeline-test/scripts/multiple_user_sample_new.hql, -d, "output1=#{output.directoryPath}", -d,"input1=s3://gwbpipeline-test/temp/sb-test/#{format(@scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_users/", -d,"input2=s3://gwbpipeline-test/temp/sb-test/#{format(@scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_user_children/"
Hive Query:
drop table if exists tbl_users;
CREATE EXTERNAL TABLE tbl_users (
user_id string, user_first_name string, user_last_name string, user_email string, user_dob string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${input1}';
drop table if exists tbl_user_children;
CREATE EXTERNAL TABLE tbl_user_children (
id string, full_name string, birthday string, type string, user_id string, facebook_id string, date_added string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${input2}';
drop table if exists tbl_users_child_output;
CREATE EXTERNAL TABLE userS3output (
user_id string, user_fname string, user_lname string, child_full_name string, child_dirthdate string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${output1}';
INSERT INTO TABLE tbl_users_child_output SELECT u.user_id, u.user_first_name, u.user_last_name, c.full_name, c.birthday FROM tbl_users as u join tbl_user_children as c ON u.user_id = c.user_id;
I was able to get this to work using the following format on step field of EMRActivity:
Basically I changed -d with -hiveconf. Also changed substitution in hive script from to. I think this is a change made on newer version of hive.
Below is the changed working code:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://gwbpipeline-test/scripts/multiple_user_sample_new.hql, -hiveconf, "output1=#{output.directoryPath}", -hiveconf,"input1=s3://gwbpipeline-test/temp/sb-test/#{format(@scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_users/", -hiveconf,"input2=s3://gwbpipeline-test/temp/sb-test/#{format(@scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_user_children/"
HIVE Query:
table if exists tbl_users;
CREATE EXTERNAL TABLE tbl_users (
user_id string, user_first_name string, user_last_name string, user_email string, user_dob string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:input1}';
drop table if exists tbl_user_children;
CREATE EXTERNAL TABLE tbl_user_children (
id string, full_name string, birthday string, type string, user_id string, facebook_id string, date_added string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:input2}';
drop table if exists tbl_users_child_output;
CREATE EXTERNAL TABLE userS3output (
user_id string, user_fname string, user_lname string, child_full_name string, child_dirthdate string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:output1}';
INSERT INTO TABLE tbl_users_child_output SELECT u.user_id, u.user_first_name, u.user_last_name, c.full_name, c.birthday FROM tbl_users as u join tbl_user_children as c ON u.user_id = c.user_id;
Hope this helps to someone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With