I have the following DataFrame:
Col1 Col2 Col3 Type 0 1 2 3 1 1 4 5 6 1 ... 20 7 8 9 2 21 10 11 12 2 ... 45 13 14 15 3 46 16 17 18 3 ...
The DataFrame is read from a csv file. All rows which have Type
1 are on top, followed by the rows with Type
2, followed by the rows with Type
3, etc.
I would like to shuffle the order of the DataFrame's rows, so that all Type
's are mixed. A possible result could be:
Col1 Col2 Col3 Type 0 7 8 9 2 1 13 14 15 3 ... 20 1 2 3 1 21 10 11 12 2 ... 45 4 5 6 1 46 16 17 18 3 ...
How can I achieve this?
One of the easiest ways to shuffle a Pandas Dataframe is to use the Pandas sample method. The df. sample method allows you to sample a number of rows in a Pandas Dataframe in a random order. Because of this, we can simply specify that we want to return the entire Pandas Dataframe, in a random order.
Shuffle DataFrame Randomly by Rows and Columns You can use df. sample(frac=1, axis=1). sample(frac=1). reset_index(drop=True) to shuffle rows and columns randomly.
We can shuffle the rows in the dataframe by using sample() function. By providing indexing to the dataframe the required task can be easily achieved. Where. sample() function is used to shuffle the rows that takes a parameter with a function called nrow() with a slice operator to get all rows shuffled.
The idiomatic way to do this with Pandas is to use the .sample
method of your dataframe to sample all rows without replacement:
df.sample(frac=1)
The frac
keyword argument specifies the fraction of rows to return in the random sample, so frac=1
means return all rows (in random order).
Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
df = df.sample(frac=1).reset_index(drop=True)
Here, specifying drop=True
prevents .reset_index
from creating a column containing the old index entries.
Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old)
is not the same as id(df_new)
), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:
$ python3 -m memory_profiler .\test.py Filename: .\test.py Line # Mem usage Increment Line Contents ================================================ 5 68.5 MiB 68.5 MiB @profile 6 def shuffle(): 7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000)) 8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With