Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct choice of chunks-specification for dask array

Tags:

python

dask

According to the dask documentaion it's possible to specify the chunks in one of three ways:

  • a blocksize like 1000
  • a blockshape like (1000, 1000)
  • explicit sizes of all blocks along all dimensions, like ((1000, 1000, 500), (400, 400))

Your chunks input will be normalized and stored in the third and most explicit form..

After trying to understand the way chunks work using the visualize() function, there are still a few things I'm not sure about:

If the input is normalized, does it matter which input form I choose?

Blocksize means every chunk has the size of X, i.e. 1000. What does the blockshape input specify?

When giving a blockshape input, does the order of parameters make a difference? How is it related to the shape of the array/matrix?

like image 366
istern Avatar asked Jan 20 '16 09:01

istern


1 Answers

The forms lower in that list are more explicit and allow for greater asymmetry in your block shapes.

Examples

We'll discuss this through a sequence of examples of chunks on the following array:

1 2 3 4 5 6
7 8 9 0 1 2
3 4 5 6 7 8
9 0 1 2 3 4 
5 6 7 8 9 0 
1 2 3 4 5 6

We show how different chunks arguments split the array into different blocks

chunks=3

Symmetric blocks of size 3

1 2 3  4 5 6
7 8 9  0 1 2
3 4 5  6 7 8

9 0 1  2 3 4 
5 6 7  8 9 0 
1 2 3  4 5 6

chunks=2

Symmetric blocks of size 2

1 2  3 4  5 6
7 8  9 0  1 2

3 4  5 6  7 8
9 0  1 2  3 4 

5 6  7 8  9 0 
1 2  3 4  5 6

chunks=(3, 2)

Asymmetric but repeated blocks of size (3, 2)

1 2  3 4  5 6
7 8  9 0  1 2
3 4  5 6  7 8

9 0  1 2  3 4 
5 6  7 8  9 0 
1 2  3 4  5 6

chunks=(1, 6)

Asymmetric but repeated blocks of size (1, 6)

1 2 3 4 5 6

7 8 9 0 1 2

3 4 5 6 7 8

9 0 1 2 3 4 

5 6 7 8 9 0 

1 2 3 4 5 6

chunks=((2, 4), (3, 3))

Asymmetric and non-repeated blocks

1 2 3  4 5 6
7 8 9  0 1 2

3 4 5  6 7 8
9 0 1  2 3 4 
5 6 7  8 9 0 
1 2 3  4 5 6

chunks=((2, 2, 1, 1), (3, 2, 1))

Asymmetric and non-repeated blocks

1 2 3  4 5  6
7 8 9  0 1  2

3 4 5  6 7  8
9 0 1  2 3  4 

5 6 7  8 9  0 

1 2 3  4 5  6

Discussion

The latter examples are rarely provided by users on original data but arise from complex slicing and broadcasting operations. Generally I use the simplest form until I need more complex forms. The choice of chunks should align with the computations you want to do.

For example, if you plan to take out thin slices along the first dimension then you might want to make that dimension skinnier than the others. If you plan to do linear algebra then you might want more symmetric blocks.

like image 78
MRocklin Avatar answered Nov 27 '22 02:11

MRocklin