I have the following DataFrame: <pre class="prettyprint"><code> Col1 Col2 Col3 Type 0 1 2 3 1 1 4 5 6 1 ... 20 7 8 9 2 21 10 11 12 2 ... 45 13 14 15 3 46 16 17 18 3 ... </code></pre> The DataFrame is read from a csv file. All rows which have <code>Type</code> 1 are on top, followed by the rows with <code>Type</code> 2, followed by the rows with <code>Type</code> 3, etc. I would like to shuffle the order of the DataFrame's rows, so that all <code>Type</code>'s are mixed. A possible result could be: <pre class="prettyprint"><code> Col1 Col2 Col3 Type 0 7 8 9 2 1 13 14 15 3 ... 20 1 2 3 1 21 10 11 12 2 ... 45 4 5 6 1 46 16 17 18 3 ... </code></pre> How can I achieve this?

The idiomatic way to do this with Pandas is to use the <code>.sample</code> method of your dataframe to sample all rows without replacement: <pre class="prettyprint lang-py prettyprint-override"><code>df.sample(frac=1) </code></pre> The <code>frac</code> keyword argument specifies the fraction of rows to return in the random sample, so <code>frac=1</code> means return all rows (in random order). <hr> Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g. <pre class="prettyprint lang-py prettyprint-override"><code>df = df.sample(frac=1).reset_index(drop=True) </code></pre> Here, specifying <code>drop=True</code> prevents <code>.reset_index</code> from creating a column containing the old index entries. Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean <code>id(df_old)</code> is not the same as <code>id(df_new)</code>), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler: <pre class="prettyprint"><code>$ python3 -m memory_profiler .\test.py Filename: .\test.py Line # Mem usage Increment Line Contents ================================================ 5 68.5 MiB 68.5 MiB @profile 6 def shuffle(): 7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000)) 8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True) </code></pre>

Shuffle DataFrame rows

Tags:

python

pandas

dataframe

shuffle

permutation

I have the following DataFrame:

    Col1  Col2  Col3  Type 0      1     2     3     1 1      4     5     6     1 ... 20     7     8     9     2 21    10    11    12     2 ... 45    13    14    15     3 46    16    17    18     3 ...

The DataFrame is read from a csv file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

I would like to shuffle the order of the DataFrame's rows, so that all Type's are mixed. A possible result could be:

    Col1  Col2  Col3  Type 0      7     8     9     2 1     13    14    15     3 ... 20     1     2     3     1 21    10    11    12     2 ... 45     4     5     6     1 46    16    17    18     3 ...

How can I achieve this?

382

asked Apr 11 '15 09:04

JNevens

1 Answers

The idiomatic way to do this with Pandas is to use the .sample method of your dataframe to sample all rows without replacement:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).

Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py Filename: .\test.py  Line #    Mem usage    Increment   Line Contents ================================================      5     68.5 MiB     68.5 MiB   @profile      6                             def shuffle():      7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))      8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

161

answered Sep 23 '22 18:09

Kris

Related questions
                            
                                How to check Django version
                            
                                How to delete the contents of a folder?
                            
                                Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
                            
                                Create list of single item repeated N times
                            
                                Does Python's time.time() return the local or UTC timestamp?
                            
                                Filter dict to contain only certain keys?
                            
                                How to calculate number of days between two given dates
                            
                                Add a new item to a dictionary in Python [duplicate]
                            
                                How to urlencode a querystring in Python?
                            
                                ImportError: Cannot import name X
                            
                                Can I force pip to reinstall the current version?
                            
                                TensorFlow not found using pip
                            
                                Split string with multiple delimiters in Python [duplicate]
                            
                                Remove specific characters from a string in Python
                            
                                How do I get indices of N maximum values in a NumPy array?
                            
                                Append integer to beginning of list in Python [duplicate]
                            
                                Unzipping files in Python
                            
                                Saving utf-8 texts with json.dumps as UTF8, not as \u escape sequence
                            
                                How to filter Pandas dataframe using 'in' and 'not in' like in SQL
                            
                                How to make a timezone aware datetime object in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With