Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly split a numpy array

Tags:

python

numpy

I have a numpy array of size 46928x28x28 and I want to randomly split that array into two sub-matrices with sizes (41928x28x28) and (5000x28x28). Therefore, to randomly pick rows from the initial array. The code I tried so far (to calculate the indexes for the two sub-arrays) is the following:

ind = np.random.randint(input_matrix.shape[0], size=(5000,))
rest = np.array([i for i in range(0,input_matrix.shape[0]) if i not in ind])
rest = np.array(rest)

However, surprisingly the shapes of ind is (5000,) while the shape of the rest is (42192,). What am I doing wrong in that case?

like image 758
konstantin Avatar asked May 23 '18 14:05

konstantin


People also ask

How do I split a NumPy array into two?

Split 2-D Array Use split() Function You can use numpy. split() function to split an array into more than one sub-arrays vertically (row-wise). There are two ways to split the array one is row-wise and the other is column-wise. By default, the array is split in row-wise (axis=0) .

How do you split an array into two parts in Python?

Use the array_split() method, pass in the array you want to split and the number of splits you want to do.


2 Answers

The error is that randint is giving some repeated indices. You can test it by printing len(set(ind)) and you will see it is smaller than 5000.

To use the same idea, simply replace the first line with

ind = np.random.choice(range(input_matrix.shape[0]), size=(5000,), replace=False)

That being said, the second line of your code is pretty slow because of the iteration over the list. It would be much faster to define the indices you want with a vector of booleans, which would allow you to use the negation operator ~.

choice = np.random.choice(range(matrix.shape[0]), size=(5000,), replace=False)    
ind = np.zeros(matrix.shape[0], dtype=bool)
ind[choice] = True
rest = ~ind

On my machine, this method is exactly as fast as implementing scikit.learn's train_test_split, which makes me think that the two are doing exactly the same thing.

like image 119
Gianluca Micchi Avatar answered Sep 20 '22 14:09

Gianluca Micchi


Just a quick update to say that this is readily solved using shuffle:

rng = np.random.default_rng()
rng.shuffle(data, axis = 0)
split1 = data[:41928]
split2 = data[41928:]

If you're using this for an ML application, this has the added benefit of randomizing the order of your train and test sets, which is often desirable. If you need to preserve the given ordering on the two split arrays, you can shuffle indices instead and re-sort:

idx = np.arange(data.shape[0])
rng.shuffle(idx)
idx1 = np.sort(idx[:41928])
idx2 = np.sort(idx[41928:])
split1 = data[idx1, ...]
split2 = data[idx2, ...]
like image 23
Grant Avatar answered Sep 20 '22 14:09

Grant