I have a numpy array of size 46928x28x28
and I want to randomly split that array into two sub-matrices with sizes (41928x28x28)
and (5000x28x28)
. Therefore, to randomly pick rows from the initial array. The code I tried so far (to calculate the indexes for the two sub-arrays) is the following:
ind = np.random.randint(input_matrix.shape[0], size=(5000,))
rest = np.array([i for i in range(0,input_matrix.shape[0]) if i not in ind])
rest = np.array(rest)
However, surprisingly the shapes of ind is (5000,)
while the shape of the rest is (42192,)
. What am I doing wrong in that case?
Split 2-D Array Use split() Function You can use numpy. split() function to split an array into more than one sub-arrays vertically (row-wise). There are two ways to split the array one is row-wise and the other is column-wise. By default, the array is split in row-wise (axis=0) .
Use the array_split() method, pass in the array you want to split and the number of splits you want to do.
The error is that randint
is giving some repeated indices. You can test it by printing len(set(ind))
and you will see it is smaller than 5000.
To use the same idea, simply replace the first line with
ind = np.random.choice(range(input_matrix.shape[0]), size=(5000,), replace=False)
That being said, the second line of your code is pretty slow because of the iteration over the list. It would be much faster to define the indices you want with a vector of booleans, which would allow you to use the negation operator ~
.
choice = np.random.choice(range(matrix.shape[0]), size=(5000,), replace=False)
ind = np.zeros(matrix.shape[0], dtype=bool)
ind[choice] = True
rest = ~ind
On my machine, this method is exactly as fast as implementing scikit.learn's train_test_split
, which makes me think that the two are doing exactly the same thing.
Just a quick update to say that this is readily solved using shuffle
:
rng = np.random.default_rng()
rng.shuffle(data, axis = 0)
split1 = data[:41928]
split2 = data[41928:]
If you're using this for an ML application, this has the added benefit of randomizing the order of your train and test sets, which is often desirable. If you need to preserve the given ordering on the two split arrays, you can shuffle indices instead and re-sort:
idx = np.arange(data.shape[0])
rng.shuffle(idx)
idx1 = np.sort(idx[:41928])
idx2 = np.sort(idx[41928:])
split1 = data[idx1, ...]
split2 = data[idx2, ...]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With