Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what are the following commands in sqoop?

Tags:

sqoop

Can anyone tell me what is the use of --split-by and boundary query in sqoop?

sqoop import --connect jdbc:mysql://localhost/my --username user --password 1234 --query 'select * from table where id=5 AND $CONDITIONS' --split-by table.id --target-dir /dir

like image 648
NJ_315 Avatar asked Jul 29 '13 11:07

NJ_315


People also ask

Which of the following command is used command is used to verify the Sqoop version?

Verify Job (--list) '--list' argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs. It shows the list of saved jobs.

Which command is used to run a Sqoop job?

Execute Job (--exec) '--exec' option is used to execute a saved job. The following command is used to execute a saved job called myjob.

What are the 2 main functions of Sqoop?

Sqoop has two main functions: importing and exporting. Importing transfers structured data into HDFS; exporting moves this data from Hadoop to external databases in the cloud or on-premises. Importing involves Sqoop assessing the external database's metadata before mapping it to Hadoop.

What will happen if we run the following command in Sqoop?

It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.


1 Answers

Split by :

  1. why it is used? -> to enhance the speed while fetching the data from rdbms to hadoop
  2. How it works? -> By default there are 4 mappers in sqoop , so the import works parallely. The entire data is divided into equal partitions. Sqoop considers primary key column for splitting the data and then finds out the maximum and minimum range from it and then makes the 4 ranges for 4 mappers to work. Eg. 1000 records in primary key column and max value =1000 and min value -0 so sqoop will create 4 ranges - (0-250) , (250-500),(500-750),(750-1000) and depending on values of column the data will be partitioned and given to 4 mappers to store it on HDFS. so if in case the primary key column is not evenly distributed so with split-by you can change the column-name for evenly partitioning.

In short: Used for partitioning of data to support parallelism and improve performance

like image 117
Tutu Kumari Avatar answered Sep 20 '22 20:09

Tutu Kumari