Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Representing a ragged array in numpy by padding

Tags:

python

numpy

I have a 1-dimensional numpy array scores of scores associated with some objects. These objects belong to some disjoint groups, and all the scores of the items in the first group are first, followed by the scores of the items in the second group, etc.

I'd like to create a 2-dimensional array where each row corresponds to a group, and each entry is the score of one of its items. If all the groups are of the same size I can just do:

scores.reshape((numGroups, groupSize))

Unfortunately, my groups may be of varying size. I understand that numpy doesn't support ragged arrays, but it is fine for me if the resulting array simply pads each row with a specified value to make all rows the same length.

To make this concrete, suppose I have set A with 3 items, B with 2 items, and C with four items.

scores = numpy.array([f(a[0]), f(a[1]), f(a[2]), f(b[0]), f(b[1]), 
                       f(c[0]), f(c[1]), f(c[2]), f(c[3])])
rowStarts = numpy.array([0, 3, 5])
paddingValue = -1.0
scoresByGroup = groupIntoRows(scores, rowStarts, paddingValue)

The desired value of scoresByGroup would be:

 [[f(a[0]), f(a[1]), f(a[2]), -1.0], 
    [f(b[0]), f(b[1]), -1.0, -1.0]
    [f(c[0]), f(c[1]), f(c[2]), f(c[3])]]

Is there some numpy function or composition of functions I can use to create groupIntoRows?

Background:

  • This operation will be used in calculating the loss for a minibatch for a gradient descent algorithm in Theano, so that's why I need to keep it as a composition of numpy functions if possible, rather than falling back on native Python.
  • It's fine to assume there is some known maximum row size
  • The original objects being scored are vectors and the scoring function is a matrix multiplication, which is why we flatten things out in the first place. It would be possible to pad everything to the maximum item set size before doing the matrix multiplication, but the biggest set is over ten times bigger than the average set size, so this is undesirable for speed reasons.
like image 381
Ryan Gabbard Avatar asked May 02 '13 19:05

Ryan Gabbard


People also ask

How do you padding an array in Python?

pad() function is used to pad the Numpy arrays. Sometimes there is a need to perform padding in Numpy arrays, then numPy. pad() function is used. The function returns the padded array of rank equal to the given array and the shape will increase according to pad_width.

Does NumPy support ragged arrays?

I understand that numpy doesn't support ragged arrays, but it is fine for me if the resulting array simply pads each row with a specified value to make all rows the same length. To make this concrete, suppose I have set A with 3 items, B with 2 items, and C with four items.

What does padding an array do?

The array padding transformation sets a dimension in an array to a new size. The goal of this transformation is to reduce the number of memory system conflicts. The transformation is applied to a full function AST. The new size can be specified by the user or can be computed automatically.

What is a ragged array NumPy?

In computer science, a jagged array, also known as a ragged array, irregular array is an array of arrays of which the member arrays can be of different lengths, producing rows of jagged edges when visualized as output.


1 Answers

Try this:

scores = np.random.rand(9)
row_starts = np.array([0, 3, 5])
row_ends = np.concatenate((row_starts, [len(scores)]))
lens = np.diff(row_ends)
pad_len = np.max(lens) - lens
where_to_pad = np.repeat(row_ends[1:], pad_len)
padding_value = -1.0
padded_scores = np.insert(scores, where_to_pad,
                          padding_value).reshape(-1, np.max(lens))

>>> padded_scores
array([[ 0.05878244,  0.40804443,  0.35640463, -1.        ],
       [ 0.39365072,  0.85313545, -1.        , -1.        ],
       [ 0.133687  ,  0.73651147,  0.98531828,  0.78940163]])
like image 182
Jaime Avatar answered Sep 27 '22 23:09

Jaime