Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I split a dataframe into multiple dataframes where each dataframe contains equal but random data [duplicate]

Tags:

pandas

How do I split a dataframe into multiple dataframes where each dataframe contains equal but random data? It is not based on a specific column.

For instance, I have one 100 rows and 30 columns in a dataframe. I want to divide this data into 5 lots. I should have 20 records in each of the dataframe with same 30 columns and there is no duplication across all the 5 lots and the way I pick the rows should be random.. I don't want the random picking on a single column.

One way I thought I will use index and numpy and divide them into lots and use that to split the dataframe. Wanted to see if someone has an easy and pandas way of doing it.

like image 702
Anil K Avatar asked May 17 '17 17:05

Anil K


People also ask

How can you compare two DataFrames are identical?

The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.


2 Answers

If you do not care about the new dataframes potentially containing some of the same information, you could use sample where frac specifies the fraction of the dataframe that you desire

df1 = df.sample(frac=0.5) # df1 is now a random sample of half the dataframe

EDIT:

If you want to avoid duplicates, you can use shuffle from sklearn

from sklearn.utils import shuffle

df = shuffle(df)
df1 = df[0:3]
df2 = df[3:6]
like image 121
Patrick Hingston Avatar answered Oct 08 '22 23:10

Patrick Hingston


Depending on your need, you could use pandas.DataFrame.sample() to randomly sample your original data frame, df.

df1 = df.sample(n=3) 
df2 = df.sample(n=3)

gives you two subsets, each with 3 samples. Equal number of records and random.

like image 38
SimplySnee Avatar answered Oct 08 '22 23:10

SimplySnee