Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy: How to randomly split/select an matrix into n-different matrices

  • I have a numpy matrix with shape of (4601, 58).
  • I want to split the matrix randomly as per 60%, 20%, 20% split based on number of rows
  • This is for Machine Learning task I need
  • Is there a numpy function that randomly selects rows?
like image 647
daydreamer Avatar asked Feb 01 '12 00:02

daydreamer


People also ask

How do you split a matrix by another matrix in python?

divide() in Python. numpy. divide(arr1, arr2, out = None, where = True, casting = 'same_kind', order = 'K', dtype = None) : Array element from first array is divided by elements from second element (all happens element-wise).

How do I split a NumPy array randomly?

You can use numpy. split() function to split an array into more than one sub-arrays vertically (row-wise). There are two ways to split the array one is row-wise and the other is column-wise. By default, the array is split in row-wise (axis=0) .

How do you split an array into multiple arrays in Python?

Splitting NumPy Arrays Splitting is reverse operation of Joining. Joining merges multiple arrays into one and Splitting breaks one array into multiple. We use array_split() for splitting arrays, we pass it the array we want to split and the number of splits.


4 Answers

you can use numpy.random.shuffle

import numpy as np

N = 4601
data = np.arange(N*58).reshape(-1, 58)
np.random.shuffle(data)

a = data[:int(N*0.6)]
b = data[int(N*0.6):int(N*0.8)]
c = data[int(N*0.8):]
like image 140
HYRY Avatar answered Oct 14 '22 03:10

HYRY


A complement to HYRY's answer if you want to shuffle consistently several arrays x, y, z with same first dimension: x.shape[0] == y.shape[0] == z.shape[0] == n_samples.

You can do:

rng = np.random.RandomState(42)  # reproducible results with a fixed seed
indices = np.arange(n_samples)
rng.shuffle(indices)
x_shuffled = x[indices]
y_shuffled = y[indices]
z_shuffled = z[indices]

And then proceed with the split of each shuffled array as in HYRY's answer.

like image 35
ogrisel Avatar answered Oct 14 '22 04:10

ogrisel


If you want to randomly select rows, you could just use random.sample from the standard Python library:

import random

population = range(4601) # Your number of rows
choice = random.sample(population, k) # k being the number of samples you require

random.sample samples without replacement, so you don't need to worry about repeated rows ending up in choice. Given a numpy array called matrix, you can select the rows by slicing, like this: matrix[choice].

Of, course, k can be equal to the number of total elements in the population, and then choice would contain a random ordering of the indices for your rows. Then you can partition choice as you please, if that's all you need.

like image 4
Ricardo Cárdenes Avatar answered Oct 14 '22 04:10

Ricardo Cárdenes


Since you need it for machine learning, here is a method I wrote:

import numpy as np

def split_random(matrix, percent_train=70, percent_test=15):
    """
    Splits matrix data into randomly ordered sets 
    grouped by provided percentages.

    Usage:
    rows = 100
    columns = 2
    matrix = np.random.rand(rows, columns)
    training, testing, validation = \
    split_random(matrix, percent_train=80, percent_test=10)

    percent_validation 10
    training (80, 2)
    testing (10, 2)
    validation (10, 2)

    Returns:
    - training_data: percentage_train e.g. 70%
    - testing_data: percent_test e.g. 15%
    - validation_data: reminder from 100% e.g. 15%
    Created by Uki D. Lucas on Feb. 4, 2017
    """

    percent_validation = 100 - percent_train - percent_test

    if percent_validation < 0:
        print("Make sure that the provided sum of " + \
        "training and testing percentages is equal, " + \
        "or less than 100%.")
        percent_validation = 0
    else:
        print("percent_validation", percent_validation)

    #print(matrix)  
    rows = matrix.shape[0]
    np.random.shuffle(matrix)

    end_training = int(rows*percent_train/100)    
    end_testing = end_training + int((rows * percent_test/100))

    training = matrix[:end_training]
    testing = matrix[end_training:end_testing]
    validation = matrix[end_testing:]
    return training, testing, validation

# TEST:
rows = 100
columns = 2
matrix = np.random.rand(rows, columns)
training, testing, validation = split_random(matrix, percent_train=80, percent_test=10) 

print("training",training.shape)
print("testing",testing.shape)
print("validation",validation.shape)

print(split_random.__doc__)
  • training (80, 2)
  • testing (10, 2)
  • validation (10, 2)
like image 2
Uki D. Lucas Avatar answered Oct 14 '22 04:10

Uki D. Lucas