Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Insert missing row data and iterate with conditions within groups

I have a dataframe and need to insert missing row data. Here is the dataframe:

df = pd.DataFrame({
    'name': ['Jim', 'Jim', 'Jim', 'Jim', 'Mike', 'Mike', 'Mike', 'Mike', 'Mike',
           'Polo', 'Polo', 'Polo', 'Polo', 'Tom', 'Tom', 'Tom', 'Tom'],
    'From_num': [80, 68, 751, 'Started', 32, 68, 126, 49, 'Started', 105, 68, 76, 'Started', 251, 49, 23, "Started"],
    'To_num':[99, 80, 68, 751, 105, 32, 68, 126, 49, 324, 105, 114, 76, 96, 115, 49, 23],
})
    name From_num  To_num
0    Jim       80      99
1    Jim       68      80
2    Jim      751      68
3    Jim  Started     751
4   Mike       32     105
5   Mike       68      32
6   Mike      126      68
7   Mike       49     126
8   Mike  Started      49
9   Polo      105     324
10  Polo       68     105
11  Polo       76     114 #Missing record between line 10 and 11
12  Polo  Started      76
13   Tom      251      96
14   Tom       49     115 # Missing record between 13 and 14
15   Tom       23      49
16   Tom  Started      23

The data record for each group (person's name) is continuous in 'From_num' to 'To_num' in each row, and aligned from bottom to top, for example Jim: 'Started' -> 751, 751->68, 68->80, 80->99; Same pattern for Mike. But there are some missing data for Polo ad Tom, e.g. I wish to insert a row between line 10 and 11: 114 -> 105 to make the whole record is continuous. Same as Tom, insert a line between 13 and 14: 115 -> 251. I tried to code with loop conditions and failed, so please help if you have any ideas. Please DO NOT directly insert those missing records as this is a simple example. A great thanks for help! Hopefully the question is clear. The expected result is below:

df_expected:
    name From_num  To_num
0    Jim       80      99
1    Jim       68      80
2    Jim      751      68
3    Jim  Started     751
4   Mike       32     105
5   Mike       68      32
6   Mike      126      68
7   Mike       49     126
8   Mike  Started      49
9   Polo      105     324
10  Polo       68     105
11  Polo      114      68 # New Inserted line
12  Polo       76     114
13  Polo  Started      76
14   Tom      251      96
15   Tom      115     251 # New Inserted line
16   Tom       49     115
17   Tom       23      49
18   Tom  Started      23
like image 774
JimmyXX Lumix Avatar asked Jul 05 '20 23:07

JimmyXX Lumix


2 Answers

We can come up this , Idea here is use shift get the match row and add the not match row to original df

s=df.groupby('name',sort=False).From_num.shift()
addingdata=pd.concat([s,df.drop('From_num',1)],axis=1)[df.To_num.ne(s)&s.notnull()]
addingdata.index-=1 
addingdata.columns=['To_num','name', 'From_num']
df=df.append(addingdata).sort_index()
df
    name From_num To_num
0    Jim       80     99
1    Jim       68     80
2    Jim      751     68
3    Jim  Started    751
4   Mike       32    105
5   Mike       68     32
6   Mike      126     68
7   Mike       49    126
8   Mike  Started     49
9   Polo      105    324
10  Polo       68    105
10  Polo      114     68
11  Polo       76    114
12  Polo  Started     76
13   Tom      251     96
13   Tom      115    251
14   Tom       49    115
15   Tom       23     49
16   Tom  Started     23
like image 172
BENY Avatar answered Nov 15 '22 09:11

BENY


We can do the following:

  1. Check if next row of To_num is equal to current row From_num
  2. Do this check per group of name
  3. For these rows, replace To_num by From_num
  4. Finally fill in To_num of next row in From_num

This solution should be fast, since it is all vectorized, except that we have to check the booleans for each group with GroupBy.apply, but that is an oké scenario to use apply.

def create_masks(d):
    shift = d['To_num'].shift(-1)
    m1 = d['From_num'].ne(shift)
    m2 = shift.notna()
    
    return m1 & m2


def create_rows(d):
    bools =  d.groupby('name').apply(create_masks).reset_index(drop=True)
    vals = d[bools].copy()
    vals['To_num'] = vals['From_num']
    vals.loc[:, 'From_num'] = d.loc[bools.shift().fillna(False), 'To_num'].to_numpy()
    d = d.append(vals).sort_index().reset_index(drop=True)
    
    return d

df = create_rows(df)

Output


    name From_num To_num
0    Jim       80     99
1    Jim       68     80
2    Jim      751     68
3    Jim  Started    751
4   Mike       32    105
5   Mike       68     32
6   Mike      126     68
7   Mike       49    126
8   Mike  Started     49
9   Polo      105    324
10  Polo       68    105
11  Polo      114     68
12  Polo       76    114
13  Polo  Started     76
14   Tom      251     96
15   Tom      115    251
16   Tom       49    115
17   Tom       23     49
18   Tom  Started     23
like image 36
Erfan Avatar answered Nov 15 '22 08:11

Erfan