I am trying to fix missing rows in a pandas DataFrame like this:
import pandas as pd
df = pd.DataFrame([[1, 1.2, 3.4], [2, 4.5, 6.7], [3, 1.3, 2.5], [4, 5.6, 7.3],
[1, 3.4, 5.8], [2, 5.7, 8.9], [4, 2.4, 2.6], [1, 6.7, 8.4],
[3, 6.9, 4.2], [4, 4.2, 1.2]], columns = ['#', 'foo', 'bar'])
The above code give me a pandas dataframe like this:
Out[10]:
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 4 2.4 2.6
7 1 6.7 8.4
8 3 6.9 4.2
9 4 4.2 1.2
As you probably noticed, the values in the '#'
column are in a repetitive pattern as 1, 2, 3, 4, 1, 2, 3, 4 ...
but with some missing values (for this instance, 3
before row 6 and 2
before row 8). My question is: Is there any built in method (function) in pandas to fill the missing rows in this dataframe according to the repetitive pattern of '#'
column? The values in the other columns of the filling rows can be NaN
, or the interpolation\extrapolation\average of the values before and\or after the filling rows. In the other words, what I want is like this:
Out[16]:
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
I tried to set the '#'
column as the index of the dataframe and reindex it with regular pattern without missing values. But the problem is the pd.reindex
doesn't work with duplicate values. I know I can always go traditional way by iterating in a loop from line to line to fix it but I am afraid this would be time consuming if working with large size data.
I would appreciate if anyone can give me a hint on this.
Pandas Series: repeat() function The repeat() function is used to repeat elements of a Series. Returns a new Series where each element of the current Series is repeated consecutively a given number of times. The number of repetitions for each element. This should be a non-negative integer.
bfill() is used to backward fill the missing values in the dataset. It will backward fill the NaN values that are present in the pandas dataframe.
Pandas DataFrame iterrows() MethodThe iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).
You need create groups some way - here is used difference of values #
and comparing with >1
by Series.le
, then is used GroupBy.apply
with Series.reindex
:
df1 = (df.groupby(df['#'].diff().lt(1).cumsum())
.apply(lambda x: x.set_index('#').reindex(range(1, 5)))
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
Another idea is create MultiIndex
and reshape by unstack
and stack
:
df = (df.set_index(['#', df['#'].diff().lt(1).cumsum()])
.unstack()
.reindex(np.arange(4)+1)
.stack(dropna=False)
.sort_index(level=1)
.reset_index(level=1, drop=True)
.reset_index())
print (df)
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
We can mark each group of 1,2,3,4
with eq
and cumsum
.
Then we groupby
on these groups and use reindex
and finally concat
them back together.
s = df['#'].eq(4).shift().cumsum().bfill()
pd.concat(
[d.set_index('#').reindex(np.arange(4)+1) for _, d in df.groupby(s)]
).reset_index()
Output
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
Note: if you would have a 4
as missing value in your #
column, this method would fail.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With