Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shuffle DataFrame rows

I have the following DataFrame:

    Col1  Col2  Col3  Type 0      1     2     3     1 1      4     5     6     1 ... 20     7     8     9     2 21    10    11    12     2 ... 45    13    14    15     3 46    16    17    18     3 ... 

The DataFrame is read from a csv file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

I would like to shuffle the order of the DataFrame's rows, so that all Type's are mixed. A possible result could be:

    Col1  Col2  Col3  Type 0      7     8     9     2 1     13    14    15     3 ... 20     1     2     3     1 21    10    11    12     2 ... 45     4     5     6     1 46    16    17    18     3 ... 

How can I achieve this?

like image 382
JNevens Avatar asked Apr 11 '15 09:04

JNevens


People also ask

How do you shuffle rows in a DataFrame?

One of the easiest ways to shuffle a Pandas Dataframe is to use the Pandas sample method. The df. sample method allows you to sample a number of rows in a Pandas Dataframe in a random order. Because of this, we can simply specify that we want to return the entire Pandas Dataframe, in a random order.

How do you randomly shuffle rows in Python?

Shuffle DataFrame Randomly by Rows and Columns You can use df. sample(frac=1, axis=1). sample(frac=1). reset_index(drop=True) to shuffle rows and columns randomly.

How do I shuffle the rows of a dataset in R?

We can shuffle the rows in the dataframe by using sample() function. By providing indexing to the dataframe the required task can be easily achieved. Where. sample() function is used to shuffle the rows that takes a parameter with a function called nrow() with a slice operator to get all rows shuffled.


1 Answers

The idiomatic way to do this with Pandas is to use the .sample method of your dataframe to sample all rows without replacement:

df.sample(frac=1) 

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).


Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True) 

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py Filename: .\test.py  Line #    Mem usage    Increment   Line Contents ================================================      5     68.5 MiB     68.5 MiB   @profile      6                             def shuffle():      7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))      8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)  
like image 161
Kris Avatar answered Sep 23 '22 18:09

Kris