Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill missing rows in a python pandas dataframe with repetitive pattern

I am trying to fix missing rows in a pandas DataFrame like this:

import pandas as pd

df = pd.DataFrame([[1, 1.2, 3.4], [2, 4.5, 6.7], [3, 1.3, 2.5], [4, 5.6, 7.3],
               [1, 3.4, 5.8], [2, 5.7, 8.9], [4, 2.4, 2.6], [1, 6.7, 8.4],
               [3, 6.9, 4.2], [4, 4.2, 1.2]], columns = ['#', 'foo', 'bar'])

The above code give me a pandas dataframe like this:

Out[10]: 
   #  foo  bar
0  1  1.2  3.4
1  2  4.5  6.7
2  3  1.3  2.5
3  4  5.6  7.3
4  1  3.4  5.8
5  2  5.7  8.9
6  4  2.4  2.6
7  1  6.7  8.4
8  3  6.9  4.2
9  4  4.2  1.2

As you probably noticed, the values in the '#' column are in a repetitive pattern as 1, 2, 3, 4, 1, 2, 3, 4 ... but with some missing values (for this instance, 3 before row 6 and 2 before row 8). My question is: Is there any built in method (function) in pandas to fill the missing rows in this dataframe according to the repetitive pattern of '#' column? The values in the other columns of the filling rows can be NaN, or the interpolation\extrapolation\average of the values before and\or after the filling rows. In the other words, what I want is like this:

Out[16]: 
    #  foo  bar
0   1  1.2  3.4
1   2  4.5  6.7
2   3  1.3  2.5
3   4  5.6  7.3
4   1  3.4  5.8
5   2  5.7  8.9
6   3  NaN  NaN
7   4  2.4  2.6
8   1  6.7  8.4
9   2  NaN  NaN
10  3  6.9  4.2
11  4  4.2  1.2

I tried to set the '#' column as the index of the dataframe and reindex it with regular pattern without missing values. But the problem is the pd.reindex doesn't work with duplicate values. I know I can always go traditional way by iterating in a loop from line to line to fix it but I am afraid this would be time consuming if working with large size data.

I would appreciate if anyone can give me a hint on this.

like image 413
B.J. Avatar asked Jul 18 '19 12:07

B.J.


People also ask

How do you repeat a series on pandas?

Pandas Series: repeat() function The repeat() function is used to repeat elements of a Series. Returns a new Series where each element of the current Series is repeated consecutively a given number of times. The number of repetitions for each element. This should be a non-negative integer.

How do you backfill in pandas?

bfill() is used to backward fill the missing values in the dataset. It will backward fill the NaN values that are present in the pandas dataframe.

How do you use Iterrows in pandas?

Pandas DataFrame iterrows() MethodThe iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).


2 Answers

You need create groups some way - here is used difference of values # and comparing with >1 by Series.le, then is used GroupBy.apply with Series.reindex:

df1 = (df.groupby(df['#'].diff().lt(1).cumsum())
         .apply(lambda x: x.set_index('#').reindex(range(1, 5)))
         .reset_index(level=0, drop=True)
         .reset_index())

print (df1)
    #  foo  bar
0   1  1.2  3.4
1   2  4.5  6.7
2   3  1.3  2.5
3   4  5.6  7.3
4   1  3.4  5.8
5   2  5.7  8.9
6   3  NaN  NaN
7   4  2.4  2.6
8   1  6.7  8.4
9   2  NaN  NaN
10  3  6.9  4.2
11  4  4.2  1.2

Another idea is create MultiIndex and reshape by unstack and stack:

df = (df.set_index(['#', df['#'].diff().lt(1).cumsum()])
       .unstack()
       .reindex(np.arange(4)+1)
       .stack(dropna=False)
       .sort_index(level=1)
       .reset_index(level=1, drop=True)
       .reset_index())
print (df)
    #  foo  bar
0   1  1.2  3.4
1   2  4.5  6.7
2   3  1.3  2.5
3   4  5.6  7.3
4   1  3.4  5.8
5   2  5.7  8.9
6   3  NaN  NaN
7   4  2.4  2.6
8   1  6.7  8.4
9   2  NaN  NaN
10  3  6.9  4.2
11  4  4.2  1.2
like image 86
jezrael Avatar answered Apr 27 '23 21:04

jezrael


We can mark each group of 1,2,3,4 with eq and cumsum.

Then we groupby on these groups and use reindex and finally concat them back together.

s = df['#'].eq(4).shift().cumsum().bfill()

pd.concat(
    [d.set_index('#').reindex(np.arange(4)+1) for _, d in df.groupby(s)]
).reset_index()

Output

    #  foo  bar
0   1  1.2  3.4
1   2  4.5  6.7
2   3  1.3  2.5
3   4  5.6  7.3
4   1  3.4  5.8
5   2  5.7  8.9
6   3  NaN  NaN
7   4  2.4  2.6
8   1  6.7  8.4
9   2  NaN  NaN
10  3  6.9  4.2
11  4  4.2  1.2

Note: if you would have a 4 as missing value in your # column, this method would fail.

like image 21
Erfan Avatar answered Apr 27 '23 22:04

Erfan