Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame Filling missing values in a column

I have a large DataFrame with the following columns:

import pandas as pd 

x = pd.read_csv('age_year.csv')
x.head()

ID  Year    age
22445   1991    
29925   1991    
76165   1991    
223725  1991    16.0
280165  1991    

The Year column has values ranging from 1991 to 2017. Most ID have an age value in each Year, for example:

x.loc[x['ID'] == 280165].to_clipboard(index = False)

ID  Year    age
280165  1991    
280165  1992    
280165  1993    
280165  1994    
280165  1995    16.0
280165  1996    17.0
280165  1997    18.0
280165  1998    19.0
280165  1999    20.0
280165  2000    21.0
280165  2001    
280165  2002    
280165  2003    
280165  2004    25.0
280165  2005    26.0
280165  2006    27.0
280165  2007    
280165  2008    
280165  2010    31.0
280165  2011    32.0
280165  2012    33.0
280165  2013    34.0
280165  2014    35.0
280165  2015    36.0
280165  2016    37.0
280165  2017    38.0

I want to fill the missing values in the age column for each unique ID based on their existing values. For example, for ID 280165 above, we know they are 29 in 2008, given that they are 31 in 2010 (28 in 2007, 24 in 2003 and so on).

How should one fill in these missing age values for many unique ID for every year? I'm not sure how to do this in a uniform way across the entire DataFrame. The data used as the example in this question can be found here.

like image 814
MI MA Avatar asked Aug 14 '20 12:08

MI MA


1 Answers

try doing:

def get_age(s):
    present = s.age.notna().idxmax()
    diff = s.loc[[present]].eval('age - Year').iat[0]
    s['age'] = diff + s.Year
    return s

df.groupby(['ID']).apply(get_age)
like image 138
Ayoub ZAROU Avatar answered Nov 03 '22 11:11

Ayoub ZAROU