Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between --split-by and --boundary-query in SQOOP?

Assuming we don't have a column where values are equally distributed, let's say we have a command like this:

sqoop import \
...
--boundary-query "SELECT min(id), max(id) from some_table"
--split-by id
...

What's the point of using --boundary-query here while --split-by does the same thing? Is there any other way to use --boundary-query? Or any other way to split data more efficiently when there is no key(unique) column?

like image 550
burakongun Avatar asked Nov 28 '16 06:11

burakongun


People also ask

What is split by in Sqoop?

--split-by : It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism.

What is boundary query in Sqoop?

Apache Sqoop finds boundaries for creating splits by using a select query with the minimum and maximum values for splitting. Such sqoop operations are called “Custom boundary queries” or boundary value queries.

Why we use $conditions in Sqoop?

Sqoop performs highly efficient data transfers by inheriting Hadoop's parallelism. To help Sqoop split your query into multiple chunks that can be transferred in parallel, you need to include the $CONDITIONS placeholder in the where clause of your query.


1 Answers

--split-by id will split your data uniformly on the basis of number of mappers (default 4).

Now boundary query by default is something like this.

--boundary-query "SELECT min(id), max(id) from some_table"

But if you know id starts from val1 and ends with val2. Then there is no point to calculate min() and max() operations. This will make sqoop command execution faster.

You can specify any arbitrary query returning val1 and val2.


Edit:

Right now (1.4.7) there is no way in sqoop to specify uneven partitions for splitting.

For example, you have data like:

1,2,3,51,52,191,192,193,194,195,196,197,198,199,200

If you defined 4 mappers in the command. It will check min and max which is 1 and 200 in our case.

Then it will split it into 4 parts:

1-50
51-100
101-150
151-200

Yes, in this 3rd mapper(101-150) will get nothing from the RDBMS table.

But there is no way to define custom partition like :

1-10
51-60
190-200

For large data (billions of rows), practically it is not suitable to find exact values like this or use another tool to find data pattern first and then prepare custom partitions.

like image 89
Dev Avatar answered Oct 03 '22 01:10

Dev