Assuming we don't have a column where values are equally distributed, let's say we have a command like this:
sqoop import \
...
--boundary-query "SELECT min(id), max(id) from some_table"
--split-by id
...
What's the point of using --boundary-query
here while --split-by
does the same thing?
Is there any other way to use --boundary-query
?
Or any other way to split data more efficiently when there is no key(unique) column?
--split-by : It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism.
Apache Sqoop finds boundaries for creating splits by using a select query with the minimum and maximum values for splitting. Such sqoop operations are called “Custom boundary queries” or boundary value queries.
Sqoop performs highly efficient data transfers by inheriting Hadoop's parallelism. To help Sqoop split your query into multiple chunks that can be transferred in parallel, you need to include the $CONDITIONS placeholder in the where clause of your query.
--split-by id
will split your data uniformly on the basis of number of mappers (default 4).
Now boundary query by default is something like this.
--boundary-query "SELECT min(id), max(id) from some_table"
But if you know id
starts from val1
and ends with val2
. Then there is no point to calculate min()
and max()
operations. This will make sqoop command execution faster.
You can specify any arbitrary query returning val1
and val2
.
Edit:
Right now (1.4.7) there is no way in sqoop to specify uneven partitions for splitting.
For example, you have data like:
1,2,3,51,52,191,192,193,194,195,196,197,198,199,200
If you defined 4 mappers in the command. It will check min and max which is 1 and 200 in our case.
Then it will split it into 4 parts:
1-50
51-100
101-150
151-200
Yes, in this 3rd mapper(101-150) will get nothing from the RDBMS table.
But there is no way to define custom partition like :
1-10
51-60
190-200
For large data (billions of rows), practically it is not suitable to find exact values like this or use another tool to find data pattern first and then prepare custom partitions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With