I have a Pandas dataframe 'df' like this :
X Y
IX1 IX2
A A1 20 30
A2 20 30
A5 20 30
B B2 20 30
B4 20 30
It lost some rows, and I want to fill in the gap in the middle like this:
X Y
IX1 IX2
A A1 20 30
A2 20 30
A3 NaN NaN
A4 NaN NaN
A5 20 30
B B2 20 30
B3 NaN NaN
B4 20 30
Is there a pythonic way to do this ?
You can replace blank/empty values with DataFrame. replace() methods. The replace() method replaces the specified value with another specified value on a specified column or on all columns of a DataFrame; replaces every case of the specified value.
Pandas DataFrame fillna() MethodThe fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.
Missing values can also be imputed using interpolation. Pandas interpolate method can be used to replace the missing values with different interpolation methods like 'polynomial', 'linear', 'quadratic'. Default method is 'linear'.
Pandas Dataframe method in Python such as fillna can be used to replace the missing values. Methods such as mean(), median() and mode() can be used on Dataframe for finding their values.
You need to construct your full index, and then use the reindex
method of the dataframe. Like so...
import pandas
import StringIO
datastring = StringIO.StringIO("""\
C1,C2,C3,C4
A,A1,20,30
A,A2,20,30
A,A5,20,30
B,B2,20,30
B,B4,20,30""")
dataframe = pandas.read_csv(datastring, index_col=['C1', 'C2'])
full_index = [('A', 'A1'), ('A', 'A2'), ('A', 'A3'),
('A', 'A4'), ('A', 'A5'), ('B', 'B1'),
('B', 'B2'), ('B', 'B3'), ('B', 'B4')]
new_df = dataframe.reindex(full_index)
new_df
C3 C4
A A1 20 30
A2 20 30
A3 NaN NaN
A4 NaN NaN
A5 20 30
B B1 NaN NaN
B2 20 30
B3 20 30
B4 20 30
And then you can use the fillna
method to set the NaNs to whatever you want.
Just had to revisit this myself...
In the current version of pandas, there is a function to build MultiIndex
from the Cartesian product of iterables. So the above solution could become:
datastring = StringIO.StringIO("""\
C1,C2,C3,C4
A,1,20,30
A,2,20,30
A,5,20,30
B,2,20,30
B,4,20,30""")
dataframe = pandas.read_csv(datastring, index_col=['C1', 'C2'])
full_index = pandas.MultiIndex.from_product([('A', 'B'), range(6)], names=['C1', 'C2'])
new_df = dataframe.reindex(full_index)
new_df
C3 C4
C1 C2
A 1 20 30
2 20 30
3 NaN NaN
4 NaN NaN
5 20 30
B 1 NaN NaN
2 20 30
3 20 30
4 20 30
5 NaN NaN
Pretty elegant, in my opinion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With