Assuming we don't have a column where values are equally distributed, let's say we have a command like this: <pre class="prettyprint"><code>sqoop import \ ... --boundary-query "SELECT min(id), max(id) from some_table" --split-by id ... </code></pre> What's the point of using <code>--boundary-query</code> here while -<code>-split-by</code> does the same thing? Is there any other way to use <code>--boundary-query</code>? Or any other way to split data more efficiently when there is no key(unique) column?

<code>--split-by id</code> will split your data uniformly on the basis of number of mappers (default 4). Now boundary query by default is something like this. <pre class="prettyprint"><code>--boundary-query "SELECT min(id), max(id) from some_table" </code></pre> But if you know <code>id</code> starts from <code>val1</code> and ends with <code>val2</code>. Then there is no point to calculate <code>min()</code> and <code>max()</code> operations. This will make sqoop command execution faster. You can specify any arbitrary query returning <code>val1</code> and <code>val2</code>. <hr> Edit: Right now (1.4.7) there is no way in sqoop to specify uneven partitions for splitting. For example, you have data like: <pre class="prettyprint"><code>1,2,3,51,52,191,192,193,194,195,196,197,198,199,200 </code></pre> If you defined 4 mappers in the command. It will check min and max which is 1 and 200 in our case. Then it will split it into 4 parts: <pre class="prettyprint"><code>1-50 51-100 101-150 151-200 </code></pre> Yes, in this 3rd mapper(101-150) will get nothing from the RDBMS table. But there is no way to define custom partition like : <pre class="prettyprint"><code>1-10 51-60 190-200 </code></pre> For large data (billions of rows), practically it is not suitable to find exact values like this or use another tool to find data pattern first and then prepare custom partitions.

What is the difference between --split-by and --boundary-query in SQOOP?

Tags:

split

boundary

sqoop

Assuming we don't have a column where values are equally distributed, let's say we have a command like this:

sqoop import \
...
--boundary-query "SELECT min(id), max(id) from some_table"
--split-by id
...

What's the point of using --boundary-query here while --split-by does the same thing? Is there any other way to use --boundary-query? Or any other way to split data more efficiently when there is no key(unique) column?

550

asked Nov 28 '16 06:11

burakongun

1 Answers

--split-by id will split your data uniformly on the basis of number of mappers (default 4).

Now boundary query by default is something like this.

--boundary-query "SELECT min(id), max(id) from some_table"

But if you know id starts from val1 and ends with val2. Then there is no point to calculate min() and max() operations. This will make sqoop command execution faster.

You can specify any arbitrary query returning val1 and val2.

Edit:

Right now (1.4.7) there is no way in sqoop to specify uneven partitions for splitting.

For example, you have data like:

1,2,3,51,52,191,192,193,194,195,196,197,198,199,200

If you defined 4 mappers in the command. It will check min and max which is 1 and 200 in our case.

Then it will split it into 4 parts:

Yes, in this 3rd mapper(101-150) will get nothing from the RDBMS table.

But there is no way to define custom partition like :

1-10
51-60
190-200

For large data (billions of rows), practically it is not suitable to find exact values like this or use another tool to find data pattern first and then prepare custom partitions.

answered Oct 03 '22 01:10

Dev

Related questions
                            
                                String split method behavior
                            
                                Using multiple delimiters for .split in Java
                            
                                How to read and split comma separated file in a bash shell script?
                            
                                R - Split numeric vector into intervals
                            
                                How to prevent a System.IndexOutOfRangeException in a LINQ WHERE?
                            
                                Split a column into 3 columns in pandas
                            
                                Split string into chunks of same letters [duplicate]
                            
                                Scala : How to split words using multiple delimeters
                            
                                Pandas split dataframe into multiple when condition is true
                            
                                How to split a string into regular intervals in R?
                            
                                rename multiple files splitting filenames by '_' and retaining first and last fields
                            
                                Split File Path into Components in iPhone SDK
                            
                                php preg_split error when switching from split to preg_split
                            
                                What to call a function that splits lists?
                            
                                how to split string contains numbers in java?
                            
                                Read line by line from a socket buffer
                            
                                splitting an array with one element into an array with many elements
                            
                                Java Split string into words commas and full stops
                            
                                Splitting a filename
                            
                                How does raw_input().strip().split() in Python work in this code?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With