I am trying to split my numpy array of data points into test and training sets. To do that, I'm randomly selecting rows from the array to use as the training set and the remaining are the test set.
This is my code:
matrix = numpy.loadtxt("matrix_vals.data", delimiter=',', dtype=float)
matrix_rows, matrix_cols = matrix.shape
# training set
randvals = numpy.random.randint(matrix_rows, size=50)
train = matrix[randvals,:]
test = numpy.delete(matrix, randvals, 0)
print matrix.shape
print train.shape
print test.shape
But the output I get is:
matrix.shape: (130, 14)
train.shape: (50, 14)
test.shape: (89, 14)
This is obviously wrong since the number of rows from train and test should add up to the total number of rows in the matrix but here it's clearly more. Can anyone help me figure out what's going wrong?
Because you are generating random integers with replacement, randvals will almost certainly contain repeat indices.
Indexing with repeated indices will return the same row multiple times, so matrix[randvals, :] is guaranteed to give you an output with exactly 50 rows, regardless of whether some of them are repeated.
In contrast, np.delete(matrix, randvals, 0) will only remove unique row indices, so it will reduce the number of rows only by the number of unique values in randvals.
Try comparing:
print(np.unique(randvals).shape[0] == matrix_rows - test.shape[0])
# True
To generate a vector of unique random indices between 0 and 1 - matrix_rows, you could use np.random.choice with replace=False:
uidx = np.random.choice(matrix_rows, size=50, replace=False)
Then matrix[uidx].shape[0] + np.delete(matrix, uidx, 0).shape[0] == matrix_rows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With