I have a dataframe and need to insert missing row data. Here is the dataframe:
df = pd.DataFrame({
'name': ['Jim', 'Jim', 'Jim', 'Jim', 'Mike', 'Mike', 'Mike', 'Mike', 'Mike',
'Polo', 'Polo', 'Polo', 'Polo', 'Tom', 'Tom', 'Tom', 'Tom'],
'From_num': [80, 68, 751, 'Started', 32, 68, 126, 49, 'Started', 105, 68, 76, 'Started', 251, 49, 23, "Started"],
'To_num':[99, 80, 68, 751, 105, 32, 68, 126, 49, 324, 105, 114, 76, 96, 115, 49, 23],
})
name From_num To_num
0 Jim 80 99
1 Jim 68 80
2 Jim 751 68
3 Jim Started 751
4 Mike 32 105
5 Mike 68 32
6 Mike 126 68
7 Mike 49 126
8 Mike Started 49
9 Polo 105 324
10 Polo 68 105
11 Polo 76 114 #Missing record between line 10 and 11
12 Polo Started 76
13 Tom 251 96
14 Tom 49 115 # Missing record between 13 and 14
15 Tom 23 49
16 Tom Started 23
The data record for each group (person's name) is continuous in 'From_num' to 'To_num' in each row, and aligned from bottom to top, for example Jim: 'Started' -> 751, 751->68, 68->80, 80->99; Same pattern for Mike. But there are some missing data for Polo ad Tom, e.g. I wish to insert a row between line 10 and 11: 114 -> 105 to make the whole record is continuous. Same as Tom, insert a line between 13 and 14: 115 -> 251. I tried to code with loop conditions and failed, so please help if you have any ideas. Please DO NOT directly insert those missing records as this is a simple example. A great thanks for help! Hopefully the question is clear. The expected result is below:
df_expected:
name From_num To_num
0 Jim 80 99
1 Jim 68 80
2 Jim 751 68
3 Jim Started 751
4 Mike 32 105
5 Mike 68 32
6 Mike 126 68
7 Mike 49 126
8 Mike Started 49
9 Polo 105 324
10 Polo 68 105
11 Polo 114 68 # New Inserted line
12 Polo 76 114
13 Polo Started 76
14 Tom 251 96
15 Tom 115 251 # New Inserted line
16 Tom 49 115
17 Tom 23 49
18 Tom Started 23
We can come up this , Idea here is use shift
get the match row and add the not match row to original df
s=df.groupby('name',sort=False).From_num.shift()
addingdata=pd.concat([s,df.drop('From_num',1)],axis=1)[df.To_num.ne(s)&s.notnull()]
addingdata.index-=1
addingdata.columns=['To_num','name', 'From_num']
df=df.append(addingdata).sort_index()
df
name From_num To_num
0 Jim 80 99
1 Jim 68 80
2 Jim 751 68
3 Jim Started 751
4 Mike 32 105
5 Mike 68 32
6 Mike 126 68
7 Mike 49 126
8 Mike Started 49
9 Polo 105 324
10 Polo 68 105
10 Polo 114 68
11 Polo 76 114
12 Polo Started 76
13 Tom 251 96
13 Tom 115 251
14 Tom 49 115
15 Tom 23 49
16 Tom Started 23
We can do the following:
To_num
is equal to current row From_num
name
To_num
by From_num
To_num
of next row in From_num
This solution should be fast, since it is all vectorized, except that we have to check the booleans
for each group with GroupBy.apply
, but that is an oké scenario to use apply
.
def create_masks(d):
shift = d['To_num'].shift(-1)
m1 = d['From_num'].ne(shift)
m2 = shift.notna()
return m1 & m2
def create_rows(d):
bools = d.groupby('name').apply(create_masks).reset_index(drop=True)
vals = d[bools].copy()
vals['To_num'] = vals['From_num']
vals.loc[:, 'From_num'] = d.loc[bools.shift().fillna(False), 'To_num'].to_numpy()
d = d.append(vals).sort_index().reset_index(drop=True)
return d
df = create_rows(df)
Output
name From_num To_num
0 Jim 80 99
1 Jim 68 80
2 Jim 751 68
3 Jim Started 751
4 Mike 32 105
5 Mike 68 32
6 Mike 126 68
7 Mike 49 126
8 Mike Started 49
9 Polo 105 324
10 Polo 68 105
11 Polo 114 68
12 Polo 76 114
13 Polo Started 76
14 Tom 251 96
15 Tom 115 251
16 Tom 49 115
17 Tom 23 49
18 Tom Started 23
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With