Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fill the missing record of Pandas dataframe in pythonic way?

Tags:

python

pandas

I have a Pandas dataframe 'df' like this :

         X   Y  
IX1 IX2
A   A1  20  30
    A2  20  30
    A5  20  30
B   B2  20  30
    B4  20  30

It lost some rows, and I want to fill in the gap in the middle like this:

         X   Y  
IX1 IX2
A   A1  20  30
    A2  20  30
    A3  NaN NaN
    A4  NaN NaN
    A5  20  30
B   B2  20  30
    B3  NaN NaN
    B4  20  30

Is there a pythonic way to do this ?

like image 287
bigbug Avatar asked Sep 12 '12 14:09

bigbug


People also ask

How do I fill blank cells in pandas DataFrame?

You can replace blank/empty values with DataFrame. replace() methods. The replace() method replaces the specified value with another specified value on a specified column or on all columns of a DataFrame; replaces every case of the specified value.

How do you fill null values in a data frame?

Pandas DataFrame fillna() MethodThe fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.

How do you replace missing values in a data set?

Missing values can also be imputed using interpolation. Pandas interpolate method can be used to replace the missing values with different interpolation methods like 'polynomial', 'linear', 'quadratic'. Default method is 'linear'.

Which method is used to fill in the blanks or missing values in a DataFrame?

Pandas Dataframe method in Python such as fillna can be used to replace the missing values. Methods such as mean(), median() and mode() can be used on Dataframe for finding their values.


1 Answers

You need to construct your full index, and then use the reindex method of the dataframe. Like so...

import pandas
import StringIO
datastring = StringIO.StringIO("""\
C1,C2,C3,C4
A,A1,20,30
A,A2,20,30
A,A5,20,30
B,B2,20,30
B,B4,20,30""")

dataframe = pandas.read_csv(datastring, index_col=['C1', 'C2'])
full_index = [('A', 'A1'), ('A', 'A2'), ('A', 'A3'), 
              ('A', 'A4'), ('A', 'A5'), ('B', 'B1'), 
              ('B', 'B2'), ('B', 'B3'), ('B', 'B4')]
new_df = dataframe.reindex(full_index)
new_df
      C3  C4
A A1  20  30
  A2  20  30
  A3 NaN NaN
  A4 NaN NaN
  A5  20  30
B B1 NaN NaN
  B2  20  30
  B3  20  30
  B4  20  30

And then you can use the fillna method to set the NaNs to whatever you want.

update (June 2014)

Just had to revisit this myself... In the current version of pandas, there is a function to build MultiIndex from the Cartesian product of iterables. So the above solution could become:

datastring = StringIO.StringIO("""\
C1,C2,C3,C4
A,1,20,30
A,2,20,30
A,5,20,30
B,2,20,30
B,4,20,30""")

dataframe = pandas.read_csv(datastring, index_col=['C1', 'C2'])
full_index = pandas.MultiIndex.from_product([('A', 'B'), range(6)], names=['C1', 'C2'])
new_df = dataframe.reindex(full_index)
new_df
      C3  C4
C1 C2
 A  1  20  30
    2  20  30
    3 NaN NaN
    4 NaN NaN
    5  20  30
 B  1 NaN NaN
    2  20  30
    3  20  30
    4  20  30
    5 NaN NaN

Pretty elegant, in my opinion.

like image 141
Paul H Avatar answered Sep 28 '22 17:09

Paul H