I have a dataframe and i'm trying to fill down the value in the 'Date' column (which is text),as follows:
The dataframe is generated using dfs=pd.read_html(pageUrl,infer_types=False)
then df=dfs[0]
Date Time datetime Year
0 None None 2007
1 May 1 0:58 None 2007
2 1:00 None 2007
3 1:30 None 2007
4 1:45 None 2007
5 3:45 None 2007
6 4:45 None 2007
7 6:30 None 2007
8 7:15 None 2007
9 7:45 None 2007
df.dtypes
shows;
Date object
Time object
datetime object
Year int64
dtype: object
Firstly I tried filling on a per-row basis. Trying to shift back one row to get the previous value if the current 'Date' is empty:
def fillDate(r):
if r['Date']=="":
p=r.shift(-1)
r['Date']=p['Date']
return r
then
df.apply(fillDate,axis=1)
This populates the 'Date' column with the 'Time'.
So then I tried applying with axis=0 (per column basis) and modifying the function so it only applies this to the 'Date' column (I can't see how to apply this to just one column)
def fillDate(r):
if r.name=='Date':
if r['Date']=="":
p=r.shift(-1)
r['Date']=p['Date']
return r
then
df.apply(fillDate,axis=0)
gives the error
KeyError: ('Date', u'occurred at index Date')
The aim is to fill down the value in the 'Date' with the value from the previous cell when the 'Date' is blank.
How can I do this?
In [16]: df = pd.read_fwf(StringIO(data),widths=[5,12,8,8,6],header=0,names=['idx','date','time','datetime','year'])
# simulate what the OP actually has (though this doesn't happen upon read in)
In [30]: df['date'] = df['date'].fillna('')
In [31]: df
Out[31]:
idx date time datetime year
0 0 None None 2007
1 1 May 1 0:58 None 2007
2 2 1:00 None 2007
3 3 1:30 None 2007
4 4 1:45 None 2007
5 5 3:45 None 2007
6 6 4:45 None 2007
7 7 6:30 None 2007
8 8 7:15 None 2007
9 9 7:45 None 2007
In [32]: df.loc[df.date=='','date'] = np.nan
In [33]: df
Out[33]:
idx date time datetime year
0 0 NaN None None 2007
1 1 May 1 0:58 None 2007
2 2 NaN 1:00 None 2007
3 3 NaN 1:30 None 2007
4 4 NaN 1:45 None 2007
5 5 NaN 3:45 None 2007
6 6 NaN 4:45 None 2007
7 7 NaN 6:30 None 2007
8 8 NaN 7:15 None 2007
9 9 NaN 7:45 None 2007
In [34]: df['date'] = df['date'].ffill()
In [35]: df
Out[35]:
idx date time datetime year
0 0 NaN None None 2007
1 1 May 1 0:58 None 2007
2 2 May 1 1:00 None 2007
3 3 May 1 1:30 None 2007
4 4 May 1 1:45 None 2007
5 5 May 1 3:45 None 2007
6 6 May 1 4:45 None 2007
7 7 May 1 6:30 None 2007
8 8 May 1 7:15 None 2007
9 9 May 1 7:45 None 2007
If I am understanding the problem correctly, it should be as easy as,
df['Date'] = ['Date'].ffill(axis=0)
This will fill any missing values with the previously available value from the same column.
Here are some links that can be used to understand the method, including the documentation, https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html https://www.studytonight.com/pandas/pandas-dataframe-ffill-method
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With