Randomly split a numpy array

Tags:

I have a numpy array of size 46928x28x28 and I want to randomly split that array into two sub-matrices with sizes (41928x28x28) and (5000x28x28). Therefore, to randomly pick rows from the initial array. The code I tried so far (to calculate the indexes for the two sub-arrays) is the following:

ind = np.random.randint(input_matrix.shape[0], size=(5000,))
rest = np.array([i for i in range(0,input_matrix.shape[0]) if i not in ind])
rest = np.array(rest)

However, surprisingly the shapes of ind is (5000,) while the shape of the rest is (42192,). What am I doing wrong in that case?

758

asked May 23 '18 14:05

konstantin

2 Answers

The error is that randint is giving some repeated indices. You can test it by printing len(set(ind)) and you will see it is smaller than 5000.

To use the same idea, simply replace the first line with

ind = np.random.choice(range(input_matrix.shape[0]), size=(5000,), replace=False)

That being said, the second line of your code is pretty slow because of the iteration over the list. It would be much faster to define the indices you want with a vector of booleans, which would allow you to use the negation operator ~.

choice = np.random.choice(range(matrix.shape[0]), size=(5000,), replace=False)    
ind = np.zeros(matrix.shape[0], dtype=bool)
ind[choice] = True
rest = ~ind

On my machine, this method is exactly as fast as implementing scikit.learn's train_test_split, which makes me think that the two are doing exactly the same thing.

119

answered Sep 20 '22 14:09

Gianluca Micchi

Just a quick update to say that this is readily solved using shuffle:

rng = np.random.default_rng()
rng.shuffle(data, axis = 0)
split1 = data[:41928]
split2 = data[41928:]

If you're using this for an ML application, this has the added benefit of randomizing the order of your train and test sets, which is often desirable. If you need to preserve the given ordering on the two split arrays, you can shuffle indices instead and re-sort:

idx = np.arange(data.shape[0])
rng.shuffle(idx)
idx1 = np.sort(idx[:41928])
idx2 = np.sort(idx[41928:])
split1 = data[idx1, ...]
split2 = data[idx2, ...]

answered Sep 20 '22 14:09

Grant

Related questions
                            
                                BULK INSERT error code 3: The system cannot find the path specified
                            
                                pandas localize and convert datetime column instead of the datetimeindex
                            
                                Cannot find file setuptools-27.2.0-py3.5.egg
                            
                                Pandas Read_Excel Datetime Converter
                            
                                Pandas is faster to load CSV than SQL
                            
                                Using Wordnet Synsets from Python for Italian Language
                            
                                How to fix PlotlyRequestError?
                            
                                When and why should I use attr.Factory?
                            
                                Pass **kwargs through to inner function [duplicate]
                            
                                Django Model Method or Calculation as Field in Database
                            
                                How to count lines of code in jupyter notebook
                            
                                Correct way to format integers with fixed length and space padding
                            
                                Pandas groupby count non-null values as percentage
                            
                                How to access request body when using Django Rest Framework and avoid getting RawPostDataException
                            
                                Concatenate pandas DataFrames generated with a loop
                            
                                Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)
                            
                                jhipster - gyp verb `which` failed Error: not found: python2
                            
                                What is difference between instance namespace and application namespace in django urls?
                            
                                how could i get both the final hidden state and sequence in a LSTM layer when using a bidirectional wrapper
                            
                                What is the difference between `sep` and `delimiter` attributes in pandas.read_csv() method?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Randomly split a numpy array

Tags:

python

numpy

konstantin

People also ask

2 Answers

Gianluca Micchi

Grant

Recent Activity

Donate For Us