Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas - Fill down text value in column where following cells are blank

I have a dataframe and i'm trying to fill down the value in the 'Date' column (which is text),as follows:

The dataframe is generated using dfs=pd.read_html(pageUrl,infer_types=False) then df=dfs[0]

            Date     Time datetime  Year
    0               None     None  2007
    1     May 1     0:58     None  2007
    2               1:00     None  2007
    3               1:30     None  2007
    4               1:45     None  2007
    5               3:45     None  2007
    6               4:45     None  2007
    7               6:30     None  2007
    8               7:15     None  2007
    9               7:45     None  2007

df.dtypes shows;

    Date        object
    Time        object
    datetime    object
    Year         int64
    dtype: object

Firstly I tried filling on a per-row basis. Trying to shift back one row to get the previous value if the current 'Date' is empty:

    def fillDate(r):
        if r['Date']=="":
            p=r.shift(-1)
            r['Date']=p['Date']
        return r

then

    df.apply(fillDate,axis=1)

This populates the 'Date' column with the 'Time'.

So then I tried applying with axis=0 (per column basis) and modifying the function so it only applies this to the 'Date' column (I can't see how to apply this to just one column)

    def fillDate(r):
        if r.name=='Date':
            if r['Date']=="":
                p=r.shift(-1)
                r['Date']=p['Date']
        return r

then

    df.apply(fillDate,axis=0)

gives the error

    KeyError: ('Date', u'occurred at index Date')

The aim is to fill down the value in the 'Date' with the value from the previous cell when the 'Date' is blank.

How can I do this?

like image 761
zio Avatar asked Dec 25 '22 20:12

zio


2 Answers

In [16]: df = pd.read_fwf(StringIO(data),widths=[5,12,8,8,6],header=0,names=['idx','date','time','datetime','year'])

# simulate what the OP actually has (though this doesn't happen upon read in)

In [30]: df['date'] = df['date'].fillna('')

In [31]: df
Out[31]: 
   idx   date  time datetime  year
0    0         None     None  2007
1    1  May 1  0:58     None  2007
2    2         1:00     None  2007
3    3         1:30     None  2007
4    4         1:45     None  2007
5    5         3:45     None  2007
6    6         4:45     None  2007
7    7         6:30     None  2007
8    8         7:15     None  2007
9    9         7:45     None  2007

In [32]: df.loc[df.date=='','date'] = np.nan

In [33]: df
Out[33]: 
   idx   date  time datetime  year
0    0    NaN  None     None  2007
1    1  May 1  0:58     None  2007
2    2    NaN  1:00     None  2007
3    3    NaN  1:30     None  2007
4    4    NaN  1:45     None  2007
5    5    NaN  3:45     None  2007
6    6    NaN  4:45     None  2007
7    7    NaN  6:30     None  2007
8    8    NaN  7:15     None  2007
9    9    NaN  7:45     None  2007

In [34]: df['date']  = df['date'].ffill()

In [35]: df
Out[35]: 
   idx   date  time datetime  year
0    0    NaN  None     None  2007
1    1  May 1  0:58     None  2007
2    2  May 1  1:00     None  2007
3    3  May 1  1:30     None  2007
4    4  May 1  1:45     None  2007
5    5  May 1  3:45     None  2007
6    6  May 1  4:45     None  2007
7    7  May 1  6:30     None  2007
8    8  May 1  7:15     None  2007
9    9  May 1  7:45     None  2007
like image 198
Jeff Avatar answered Dec 28 '22 13:12

Jeff


If I am understanding the problem correctly, it should be as easy as,

df['Date'] = ['Date'].ffill(axis=0)

This will fill any missing values with the previously available value from the same column.

Here are some links that can be used to understand the method, including the documentation, https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html https://www.studytonight.com/pandas/pandas-dataframe-ffill-method

like image 28
Minura Punchihewa Avatar answered Dec 28 '22 15:12

Minura Punchihewa