random sampling with Pandas data frame disjoint groups

Question

I need to randomly separate a data frame into two disjoint sets by the attribute 'ids'. For example, consider the following data frame:

df=
Out[470]: 
          0     1     2     3       ids
0      17.0  18.0  16.0  15.0      13.0
1      18.0  16.0  15.0  15.0      13.0
2      16.0  15.0  15.0  16.0      13.0
131    12.0   8.0  21.0  19.0      14.0
132     8.0  21.0  19.0  20.0      14.0
133    21.0  19.0  20.0   9.0      14.0
248     NaN   NaN  12.0  11.0      17.0
249     NaN  12.0  11.0  10.0      17.0
250    12.0  11.0  10.0   NaN      17.0
287     3.0   3.0   1.0   8.0      20.0
288     3.0   1.0   8.0   3.0      20.0
289     1.0   8.0   3.0   3.0      20.0
413    21.0   7.0  16.0  18.0      25.0
414     7.0  16.0  18.0  19.0      25.0
415    16.0  18.0  19.0  18.0      25.0
665    10.0   8.0   8.0   7.0      27.0
666     8.0   8.0   7.0   9.0      27.0
667     8.0   7.0   9.0   8.0      27.0
790     NaN   NaN  15.0   NaN      33.0
791     NaN  15.0   NaN  10.0      33.0
792    15.0   NaN  10.0   NaN      33.0
812     NaN  16.0   NaN  17.0      34.0
813    16.0   NaN  17.0   NaN      34.0
814     NaN  17.0   NaN  13.0      34.0
944     3.0   4.0   3.0  18.0      35.0
945     4.0   3.0  18.0  18.0      35.0
946     3.0  18.0  18.0  11.0      35.0
1059    9.0  10.0   3.0   4.0      56.0
1060   10.0   3.0   4.0   3.0      56.0
1061    3.0   4.0   3.0   3.0      56.0
    ...   ...   ...   ...       ...
10125   NaN   9.0   5.0   5.0  101317.0
10126   9.0   5.0   5.0   5.0  101317.0
10127   5.0   5.0   5.0   7.0  101317.0

I need to get two (randomly separated with some fraction size) dataframes with no intersecting values of ids.

I know how to solve it in 'non-pandasian' way:

get the unique values of the ids
randomly split the unique values into two disjoint groups
select row according to values of ids in two groups using .isin()

I am wondering whether there is a simple and neat way to do it with some pandas built-in function, like .sample()?

root · Accepted Answer

Using sklearn.model_selection.GroupShuffleSplit to perform the split:

from sklearn.model_selection import GroupShuffleSplit

# Initialize the GroupShuffleSplit.
gss = GroupShuffleSplit(n_splits=1, test_size=0.5)

# Get the indexers for the split.
idx1, idx2 = next(gss.split(df, groups=df.ids))

# Get the split DataFrames.
df1, df2 = df.iloc[idx1], df.iloc[idx2]

MaxU - stop WAR against UA · Answer

UPDATE:

df1 = df.sample(frac=1).loc[df.ids % 2 == 0]
df2 = df.loc[df.index.difference(df1.index)]

OLD incorrect (it doesn't care of separating IDs) answer:

you can first shuffle your DF using sample(frac=1) and then use np.split():

df1, df2 = np.split(df.sample(frac=1), 2)

random sampling with Pandas data frame disjoint groups

Tags:

python

pandas

disjoint-sets

Arnold Klein

2 Answers

root

MaxU - stop WAR against UA

Recent Activity

Donate For Us

random sampling with Pandas data frame disjoint groups

Tags:

python

pandas

disjoint-sets

Arnold Klein

2 Answers

root

MaxU - stop WAR against UA

Related questions

Recent Activity

Donate For Us