I am trying to fix missing rows in a pandas DataFrame like this: <pre class="prettyprint"><code>import pandas as pd df = pd.DataFrame([[1, 1.2, 3.4], [2, 4.5, 6.7], [3, 1.3, 2.5], [4, 5.6, 7.3], [1, 3.4, 5.8], [2, 5.7, 8.9], [4, 2.4, 2.6], [1, 6.7, 8.4], [3, 6.9, 4.2], [4, 4.2, 1.2]], columns = ['#', 'foo', 'bar']) </code></pre> The above code give me a pandas dataframe like this: <pre class="prettyprint"><code>Out[10]: # foo bar 0 1 1.2 3.4 1 2 4.5 6.7 2 3 1.3 2.5 3 4 5.6 7.3 4 1 3.4 5.8 5 2 5.7 8.9 6 4 2.4 2.6 7 1 6.7 8.4 8 3 6.9 4.2 9 4 4.2 1.2 </code></pre> As you probably noticed, the values in the <code>'#'</code> column are in a repetitive pattern as <code>1, 2, 3, 4, 1, 2, 3, 4 ...</code> but with some missing values (for this instance, <code>3</code> before row 6 and <code>2</code> before row 8). My question is: Is there any built in method (function) in pandas to fill the missing rows in this dataframe according to the repetitive pattern of <code>'#'</code> column? The values in the other columns of the filling rows can be <code>NaN</code>, or the interpolation\extrapolation\average of the values before and\or after the filling rows. In the other words, what I want is like this: <pre class="prettyprint"><code>Out[16]: # foo bar 0 1 1.2 3.4 1 2 4.5 6.7 2 3 1.3 2.5 3 4 5.6 7.3 4 1 3.4 5.8 5 2 5.7 8.9 6 3 NaN NaN 7 4 2.4 2.6 8 1 6.7 8.4 9 2 NaN NaN 10 3 6.9 4.2 11 4 4.2 1.2 </code></pre> I tried to set the <code>'#'</code> column as the index of the dataframe and reindex it with regular pattern without missing values. But the problem is the <code>pd.reindex</code> doesn't work with duplicate values. I know I can always go traditional way by iterating in a loop from line to line to fix it but I am afraid this would be time consuming if working with large size data. I would appreciate if anyone can give me a hint on this.

We can mark each group of <code>1,2,3,4</code> with <code>eq</code> and <code>cumsum</code>. Then we <code>groupby</code> on these groups and use <code>reindex</code> and finally <code>concat</code> them back together. <pre class="prettyprint"><code>s = df['#'].eq(4).shift().cumsum().bfill() pd.concat( [d.set_index('#').reindex(np.arange(4)+1) for _, d in df.groupby(s)] ).reset_index() </code></pre> Output <pre class="prettyprint"><code> # foo bar 0 1 1.2 3.4 1 2 4.5 6.7 2 3 1.3 2.5 3 4 5.6 7.3 4 1 3.4 5.8 5 2 5.7 8.9 6 3 NaN NaN 7 4 2.4 2.6 8 1 6.7 8.4 9 2 NaN NaN 10 3 6.9 4.2 11 4 4.2 1.2 </code></pre> <hr> Note: if you would have a <code>4</code> as missing value in your <code>#</code> column, this method would fail.

Fill missing rows in a python pandas dataframe with repetitive pattern

I am trying to fix missing rows in a pandas DataFrame like this:

import pandas as pd

df = pd.DataFrame([[1, 1.2, 3.4], [2, 4.5, 6.7], [3, 1.3, 2.5], [4, 5.6, 7.3],
               [1, 3.4, 5.8], [2, 5.7, 8.9], [4, 2.4, 2.6], [1, 6.7, 8.4],
               [3, 6.9, 4.2], [4, 4.2, 1.2]], columns = ['#', 'foo', 'bar'])

The above code give me a pandas dataframe like this:

Out[10]: 
   #  foo  bar
0  1  1.2  3.4
1  2  4.5  6.7
2  3  1.3  2.5
3  4  5.6  7.3
4  1  3.4  5.8
5  2  5.7  8.9
6  4  2.4  2.6
7  1  6.7  8.4
8  3  6.9  4.2
9  4  4.2  1.2

As you probably noticed, the values in the '#' column are in a repetitive pattern as 1, 2, 3, 4, 1, 2, 3, 4 ... but with some missing values (for this instance, 3 before row 6 and 2 before row 8). My question is: Is there any built in method (function) in pandas to fill the missing rows in this dataframe according to the repetitive pattern of '#' column? The values in the other columns of the filling rows can be NaN, or the interpolation\extrapolation\average of the values before and\or after the filling rows. In the other words, what I want is like this:

Out[16]: 
    #  foo  bar
0   1  1.2  3.4
1   2  4.5  6.7
2   3  1.3  2.5
3   4  5.6  7.3
4   1  3.4  5.8
5   2  5.7  8.9
6   3  NaN  NaN
7   4  2.4  2.6
8   1  6.7  8.4
9   2  NaN  NaN
10  3  6.9  4.2
11  4  4.2  1.2

I tried to set the '#' column as the index of the dataframe and reindex it with regular pattern without missing values. But the problem is the pd.reindex doesn't work with duplicate values. I know I can always go traditional way by iterating in a loop from line to line to fix it but I am afraid this would be time consuming if working with large size data.

I would appreciate if anyone can give me a hint on this.

How do you repeat a series on pandas?

Pandas Series: repeat() function The repeat() function is used to repeat elements of a Series. Returns a new Series where each element of the current Series is repeated consecutively a given number of times. The number of repetitions for each element. This should be a non-negative integer.

How do you backfill in pandas?

bfill() is used to backward fill the missing values in the dataset. It will backward fill the NaN values that are present in the pandas dataframe.

How do you use Iterrows in pandas?

Pandas DataFrame iterrows() MethodThe iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).

You need create groups some way - here is used difference of values # and comparing with >1 by Series.le, then is used GroupBy.apply with Series.reindex:

df1 = (df.groupby(df['#'].diff().lt(1).cumsum())
         .apply(lambda x: x.set_index('#').reindex(range(1, 5)))
         .reset_index(level=0, drop=True)
         .reset_index())

print (df1)
    #  foo  bar
0   1  1.2  3.4
1   2  4.5  6.7
2   3  1.3  2.5
3   4  5.6  7.3
4   1  3.4  5.8
5   2  5.7  8.9
6   3  NaN  NaN
7   4  2.4  2.6
8   1  6.7  8.4
9   2  NaN  NaN
10  3  6.9  4.2
11  4  4.2  1.2

Another idea is create MultiIndex and reshape by unstack and stack:

df = (df.set_index(['#', df['#'].diff().lt(1).cumsum()])
       .unstack()
       .reindex(np.arange(4)+1)
       .stack(dropna=False)
       .sort_index(level=1)
       .reset_index(level=1, drop=True)
       .reset_index())
print (df)
    #  foo  bar
0   1  1.2  3.4
1   2  4.5  6.7
2   3  1.3  2.5
3   4  5.6  7.3
4   1  3.4  5.8
5   2  5.7  8.9
6   3  NaN  NaN
7   4  2.4  2.6
8   1  6.7  8.4
9   2  NaN  NaN
10  3  6.9  4.2
11  4  4.2  1.2

We can mark each group of 1,2,3,4 with eq and cumsum.

Then we groupby on these groups and use reindex and finally concat them back together.

s = df['#'].eq(4).shift().cumsum().bfill()

pd.concat(
    [d.set_index('#').reindex(np.arange(4)+1) for _, d in df.groupby(s)]
).reset_index()

Output

    #  foo  bar
0   1  1.2  3.4
1   2  4.5  6.7
2   3  1.3  2.5
3   4  5.6  7.3
4   1  3.4  5.8
5   2  5.7  8.9
6   3  NaN  NaN
7   4  2.4  2.6
8   1  6.7  8.4
9   2  NaN  NaN
10  3  6.9  4.2
11  4  4.2  1.2

Note: if you would have a 4 as missing value in your # column, this method would fail.

Fill missing rows in a python pandas dataframe with repetitive pattern

Tags:

python-3.x

pandas

dataframe

B.J.

People also ask

2 Answers

jezrael

Erfan

Recent Activity

Donate For Us

Fill missing rows in a python pandas dataframe with repetitive pattern

Tags:

python-3.x

pandas

dataframe

B.J.

People also ask

2 Answers

jezrael

Erfan

Related questions

Recent Activity

Donate For Us