Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to randomly split a DataFrame into several smaller DataFrames?

I'm having trouble randomly splitting DataFrame df into groups of smaller DataFrames.

df
  movie_id  1   2   4   5   6   7   8   9   10  11  12  borda
0   1       5   4   0   4   4   0   0   0   4   0   0   21
1   2       3   0   0   3   0   0   0   0   0   0   0   6   
2   3       4   0   0   0   0   0   0   0   0   0   0   4   
3   4       3   0   0   0   0   5   0   0   4   0   5   17  
4   5       3   0   0   0   0   0   0   0   0   0   0   3   
5   6       5   0   0   0   0   0   0   5   0   0   0   10  
6   7       4   0   0   0   2   5   3   4   4   0   0   22  
7   8       1   0   0   0   4   5   0   0   0   4   0   14  
8   9       5   0   0   0   4   5   0   0   4   5   0   23  
9   10      3   2   0   0   0   4   0   0   0   0   0   9   
10  11      2   0   4   0   0   3   3   0   4   2   0   18  
11  12      5   0   0   0   4   5   0   0   5   2   0   21  
12  13      5   4   0   0   2   0   0   0   3   0   0   14  
13  14      5   4   0   0   5   0   0   0   0   0   0   14  
14  15      5   0   0   0   3   0   0   0   0   5   5   18  
15  16      5   0   0   0   0   0   0   0   4   0   0   9   
16  17      3   0   0   4   0   0   0   0   0   0   0   7   
17  18      4   0   0   0   0   0   0   0   0   0   0   4   
18  19      5   3   0   0   4   0   0   0   0   0   0   12  
19  20      4   0   0   0   0   0   0   0   0   0   0   4   
20  21      1   0   0   3   3   0   0   0   0   0   0   7   
21  22      4   0   0   0   3   5   5   0   5   4   0   26  
22  23      4   0   0   0   4   3   0   0   5   0   0   16  
23  24      3   0   0   4   0   0   0   0   0   3   0   10  

I've tried sample and arange, but with bad results.

ran1 = df.sample(frac=0.2, replace=False, random_state=1)
ran2 = df.sample(frac=0.2, replace=False, random_state=1)
ran3 = df.sample(frac=0.2, replace=False, random_state=1)
ran4 = df.sample(frac=0.2, replace=False, random_state=1)
ran5 = df.sample(frac=0.2, replace=False, random_state=1)

print(ran1, '\n')
print(ran2, '\n')
print(ran3, '\n')
print(ran4, '\n')
print(ran5, '\n')

This turned out to be 5 exact same DataFrames.

   movie_id  1  2  4  5  6  7  8  9  10  11  12  borda  
13    14     5  4  0  0  5  0  0  0   0   0   0     14  
18    19     5  3  0  0  4  0  0  0   0   0   0     12  
3     4      3  0  0  0  0  5  0  0   4   0   5     17  
14    15     5  0  0  0  3  0  0  0   0   5   5     18  
20    21     1  0  0  3  3  0  0  0   0   0   0      7  

Also I've tried :

g = df.groupby(['movie_id'])
h = np.arange(g.ngroups)
np.random.shuffle(h)

df[g.ngroup().isin(h[:6])]

The output :

    movie_id    1   2   4   5   6   7   8   9   10  11  12  borda   
4      5        3   0   0   0   0   0   0   0   0   0   0   3   
6      7        4   0   0   0   2   5   3   4   4   0   0   22  
7      8        1   0   0   0   4   5   0   0   0   4   0   14  
16     17       3   0   0   4   0   0   0   0   0   0   0   7   
17     18       4   0   0   0   0   0   0   0   0   0   0   4   
18     19       5   3   0   0   4   0   0   0   0   0   0   12  

But there's still only one smaller group, other datas from df aren't grouped.

I'm expecting the smaller groups to be split evenly by using percentage. And the whole df should be split into groups.

like image 436
Jerry Chen Avatar asked Feb 17 '19 05:02

Jerry Chen


People also ask

Can you slice a DataFrame?

Slicing a DataFrame in Pandas includes the following steps: Ensure Python is installed (or install ActivePython) Import a dataset. Create a DataFrame. Slice the DataFrame.


2 Answers

Use np.array_split

shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 5)  

df.sample(frac=1) shuffle the rows of df. Then use np.array_split split it into parts that have equal size.

It gives you:

for part in result:
    print(part,'\n')
    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
5          6  5  0  0  0  0  0  0  5   0   0   0     10
4          5  3  0  0  0  0  0  0  0   0   0   0      3
7          8  1  0  0  0  4  5  0  0   0   4   0     14
16        17  3  0  0  4  0  0  0  0   0   0   0      7
22        23  4  0  0  0  4  3  0  0   5   0   0     16 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
13        14  5  4  0  0  5  0  0  0   0   0   0     14
14        15  5  0  0  0  3  0  0  0   0   5   5     18
21        22  4  0  0  0  3  5  5  0   5   4   0     26
1          2  3  0  0  3  0  0  0  0   0   0   0      6
20        21  1  0  0  3  3  0  0  0   0   0   0      7 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
10        11  2  0  4  0  0  3  3  0   4   2   0     18
9         10  3  2  0  0  0  4  0  0   0   0   0      9
11        12  5  0  0  0  4  5  0  0   5   2   0     21
8          9  5  0  0  0  4  5  0  0   4   5   0     23
12        13  5  4  0  0  2  0  0  0   3   0   0     14 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
18        19  5  3  0  0  4  0  0  0   0   0   0     12
3          4  3  0  0  0  0  5  0  0   4   0   5     17
0          1  5  4  0  4  4  0  0  0   4   0   0     21
23        24  3  0  0  4  0  0  0  0   0   3   0     10
6          7  4  0  0  0  2  5  3  4   4   0   0     22 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
17        18  4  0  0  0  0  0  0  0   0   0   0      4
2          3  4  0  0  0  0  0  0  0   0   0   0      4
15        16  5  0  0  0  0  0  0  0   4   0   0      9
19        20  4  0  0  0  0  0  0  0   0   0   0      4 
like image 63
Dawei Avatar answered Oct 21 '22 13:10

Dawei


A simple demo:

df = pd.DataFrame({"movie_id": np.arange(1, 25),
          "borda": np.random.randint(1, 25, size=(24,))})
n_split = 5
# the indices used to select parts from dataframe
ixs = np.arange(df.shape[0])
np.random.shuffle(ixs)
# np.split cannot work when there is no equal division
# so we need to find out the split points ourself
# we need (n_split-1) split points
split_points = [i*df.shape[0]//n_split for i in range(1, n_split)]
# use these indices to select the part we want
for ix in np.split(ixs, split_points):
    print(df.iloc[ix])

The result:

    borda  movie_id
8       3         9
10      2        11
22     14        23
7      14         8

    borda  movie_id
0      16         1
20      4        21
17     15        18
15      1        16
6       6         7

    borda  movie_id
9       9        10
19      4        20
5       1         6
16     23        17
21     20        22

    borda  movie_id
11     24        12
23      5        24
1      22         2
12      7        13
18     15        19

    borda  movie_id
3      11         4
14     10        15
2       6         3
4       7         5
13     21        14
like image 32
keineahnung2345 Avatar answered Oct 21 '22 11:10

keineahnung2345