Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

shuffling/permutating a DataFrame in pandas

People also ask

Can you shuffle a pandas DataFrame?

One of the easiest ways to shuffle a Pandas Dataframe is to use the Pandas sample method. The df. sample method allows you to sample a number of rows in a Pandas Dataframe in a random order. Because of this, we can simply specify that we want to return the entire Pandas Dataframe, in a random order.

How do I randomly shuffle rows of DataFrame in Python?

Shuffle DataFrame Randomly by Rows and Columns You can use df. sample(frac=1, axis=1). sample(frac=1). reset_index(drop=True) to shuffle rows and columns randomly.


Use numpy's random.permuation function:

In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [2]: df
Out[2]:
   A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9


In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
   A  B
0  0  0
5  5  5
6  6  6
3  3  3
8  8  8
7  7  7
9  9  9
1  1  1
2  2  2
4  4  4

Sampling randomizes, so just sample the entire data frame.

df.sample(frac=1)

As @Corey Levinson notes, you have to be careful when you reassign:

df['column'] = df['column'].sample(frac=1).reset_index(drop=True)

In [16]: def shuffle(df, n=1, axis=0):     
    ...:     df = df.copy()
    ...:     for _ in range(n):
    ...:         df.apply(np.random.shuffle, axis=axis)
    ...:     return df
    ...:     

In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [18]: shuffle(df)

In [19]: df
Out[19]: 
   A  B
0  8  5
1  1  7
2  7  3
3  6  2
4  3  4
5  0  1
6  9  0
7  4  6
8  2  8
9  5  9

You can use sklearn.utils.shuffle() (requires sklearn 0.16.1 or higher to support Pandas data frames):

# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))

# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))

outputs:

df:    A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4


df:    A  B
1  1  1
0  0  0
3  3  3
4  4  4
2  2  2

Then you can use df.reset_index() to reset the index column, if needs to be:

df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)

outputs:

df:    A  B
0  1  1
1  0  0
2  4  4
3  2  2
4  3  3

A simple solution in pandas is to use the sample method independently on each column. Use apply to iterate over each column:

df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6

df.apply(lambda x: x.sample(frac=1).values)

   a  b
0  4  2
1  1  6
2  6  5
3  5  3
4  2  4
5  3  1

You must use .value so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:

df.apply(lambda x: x.sample(frac=1))

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6

From the docs use sample():

In [79]: s = pd.Series([0,1,2,3,4,5])

# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]: 
0    0
dtype: int64

# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]: 
5    5
2    2
4    4
dtype: int64

# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]: 
5    5
4    4
1    1
dtype: int64