How do I split a dataframe into multiple dataframes where each dataframe contains equal but random data? It is not based on a specific column.
For instance, I have one 100 rows and 30 columns in a dataframe. I want to divide this data into 5 lots. I should have 20 records in each of the dataframe with same 30 columns and there is no duplication across all the 5 lots and the way I pick the rows should be random.. I don't want the random picking on a single column.
One way I thought I will use index and numpy and divide them into lots and use that to split the dataframe. Wanted to see if someone has an easy and pandas way of doing it.
The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.
If you do not care about the new dataframes potentially containing some of the same information, you could use sample
where frac
specifies the fraction of the dataframe that you desire
df1 = df.sample(frac=0.5) # df1 is now a random sample of half the dataframe
EDIT:
If you want to avoid duplicates, you can use shuffle
from sklearn
from sklearn.utils import shuffle
df = shuffle(df)
df1 = df[0:3]
df2 = df[3:6]
Depending on your need, you could use pandas.DataFrame.sample() to randomly sample your original data frame, df.
df1 = df.sample(n=3)
df2 = df.sample(n=3)
gives you two subsets, each with 3 samples. Equal number of records and random.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With