I have a dataframe that contains 5 columns and I am using pandas and numpy to edit and work with the data. <pre class="prettyprint"><code>id calv1 calv2 calv3 calv4 1 2006-08-29 2007-08-29 2008-08-29 2009-08-29 2 NaT NaT NaT NaT 3 2006-08-29 NaT NaT NaT 4 2006-08-29 2007-08-29 2010-08-29 NaT 5 2006-08-29 2013-08-29 NaT NaT 6 2006-08-29 NaT 2013-08-29 2013-08-292 </code></pre> I want to create another column that counts the number of "calv" that occur for each id. However it matters to me if there are missing values inbetween other values, see row 6. Then I want there to be a NaN or perhaps some other value indicating this is not a correct row. <pre class="prettyprint"><code>id calv1 calv2 calv3 calv4 no_calv 1 2006-08-29 2007-08-29 2008-08-29 2009-08-29 4 2 NaT NaT NaT NaT 0 3 2006-08-29 NaT NaT NaT 1 4 2006-08-29 2007-08-29 2010-08-29 NaT 3 5 2006-08-29 2013-08-29 NaT NaT 2 6 2006-08-29 NaT 2013-08-29 2013-08-292 NaN #or some other value </code></pre> Here is my last attempt: <pre class="prettyprint"><code>nat = np.datetime64('NaT') df.loc[ (df["calv1"] == nat) & (df["calv2"] == nat) & (df["calv3"] == nat) & (df["calv4"] == nat), "no_calv"] = 0 #1 calvings df.loc[ (df["calv1"] != nat) & (df["calv2"] == nat) & (df["calv3"] == nat) & (df["calv4"] == nat), "no_calv"] = 1 #2 calvings df.loc[ (df["calv1"] != nat) & (df["calv2"] != nat) & (df["calv3"] == nat) & (df["calv4"] == nat), "no_calv"] = 2 #3 calvings df.loc[ (df["calv1"] != nat) & (df["calv2"] != nat) & (df["calv3"] != nat) & (df["calv4"] == nat), "no_calv"] = 3 #4 or more calvings df.loc[ (df["calv1"] != nat) & (df["calv2"] != nat) & (df["calv3"] != nat) & (df["calv4"] != nat), "no_calv"] = 4 </code></pre> But the result is that the whole "no_calv" column is 4.0 I previously tried things like <pre class="prettyprint"><code>.. (df["calv1"] != "NaT") .. </code></pre> And <pre class="prettyprint"><code>.. (df["calv1"] != pd.nat) .. </code></pre> And the result was always 4.0 for the whole column or just NaN. I can't seem to find a way of telling python what the NaT values are? Any tips and tricks for a new python user? I've done this both in SAS and in Fortran using if and elseif statements but I am trying to find the best way to do this in Python. Edit: I'm really curious to know if this can be done by if or ifelse statements. And now I'm also thinking I would like to be able to have other columns in the dataframe that contain extra info but are not needed for this exact purpose. An example (an added yx column): <pre class="prettyprint"><code>id yx calv1 calv2 calv3 calv4 no_calv 1 27 2006-08-29 2007-08-29 2008-08-29 2009-08-29 4 2 34 NaT NaT NaT NaT 0 3 89 2006-08-29 NaT NaT NaT 1 4 23 2006-08-29 2007-08-29 2010-08-29 NaT 3 5 11 2006-08-29 2013-08-29 NaT NaT 2 6 43 2006-08-29 NaT 2013-08-29 2013-08-292 NaN #or some other value </code></pre>

Another way of doing it using <code>pd.Series.last_valid_index</code> and <code>pd.DataFrame.count</code>: <pre class="prettyprint"><code>>>> df2 = df.copy() >>> df2.columns = np.arange(df2.shape[1]) + 1 >>> mask = (df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1)) >>> df.loc[mask, 'no_calv'] = df.notna().sum(1) >>> df calv1 calv2 calv3 calv4 no_calv id 1 2006-08-29 2007-08-29 2008-08-29 2009-08-29 4.0 2 NaN NaN NaN NaN 0.0 3 2006-08-29 NaN NaN NaN 1.0 4 2006-08-29 2007-08-29 2010-08-29 NaN 3.0 5 2006-08-29 2013-08-29 NaN NaN 2.0 6 2006-08-29 NaN 2013-08-29 2013-08-292 NaN </code></pre> <h4>Explanation:</h4> <code>pd.Series.last_valid_index</code> returns the position of last valid data in a series. Applying it on your rows will tell the column positions where last valid data is (after which there are all <code>NaNs/NaTs</code>). Below I temporarily replaced the column names with integer indices and then applied <code>pd.Series.last_valid_index</code> on each row: <pre class="prettyprint"><code>>>> df2.columns = np.arange(df2.shape[1]) + 1 >>> df2 1 2 3 4 id 1 2006-08-29 2007-08-29 2008-08-29 2009-08-29 2 NaN NaN NaN NaN 3 2006-08-29 NaN NaN NaN 4 2006-08-29 2007-08-29 2010-08-29 NaN 5 2006-08-29 2013-08-29 NaN NaN 6 2006-08-29 NaN 2013-08-29 2013-08-292 >>> df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) id 1 4.0 2 0.0 3 1.0 4 3.0 5 2.0 6 4.0 dtype: float64 </code></pre> So on row 1, last valid data is in column 4, on row 2 there is no valid data, and so on. Now let's count no. of valid data in each row: <pre class="prettyprint"><code>>>> df2.count(axis=1) id 1 4 2 0 3 1 4 3 5 2 6 3 dtype: int64 </code></pre> So, on row 1, there are 4 valid values, on row 2 no valid values, and so on. Now if all <code>NaN/NaT</code> values are towards the end of the row, the counts should match last valid data position we calculated above: <pre class="prettyprint"><code>>>> df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1) id 1 True 2 True 3 True 4 True 5 True 6 False dtype: bool </code></pre> So as seen, it matches on all rows except the last, because NaT appears in the middle of valid values in last row. We can use this as mask, and then fill the sum: <pre class="prettyprint"><code>>>> mask = (df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1)) >>> df.loc[mask, 'no_calv'] = df.notna().sum(1) >>> df calv1 calv2 calv3 calv4 no_calv id 1 2006-08-29 2007-08-29 2008-08-29 2009-08-29 4.0 2 NaN NaN NaN NaN 0.0 3 2006-08-29 NaN NaN NaN 1.0 4 2006-08-29 2007-08-29 2010-08-29 NaN 3.0 5 2006-08-29 2013-08-29 NaN NaN 2.0 6 2006-08-29 NaN 2013-08-29 2013-08-292 NaN </code></pre>

How to use df.loc (or some other method) to make a new column based on specific conditions?

Tags:

python

pandas

dataframe

I have a dataframe that contains 5 columns and I am using pandas and numpy to edit and work with the data.

id      calv1      calv2      calv3      calv4 
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29
2         NaT        NaT        NaT        NaT         
3  2006-08-29        NaT        NaT        NaT
4  2006-08-29 2007-08-29 2010-08-29        NaT
5  2006-08-29 2013-08-29        NaT        NaT
6  2006-08-29        NaT 2013-08-29 2013-08-292

I want to create another column that counts the number of "calv" that occur for each id. However it matters to me if there are missing values inbetween other values, see row 6. Then I want there to be a NaN or perhaps some other value indicating this is not a correct row.

id      calv1      calv2      calv3      calv4 no_calv
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2         NaT        NaT        NaT        NaT       0 
3  2006-08-29        NaT        NaT        NaT       1
4  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  2006-08-29 2013-08-29        NaT        NaT       2
6  2006-08-29        NaT 2013-08-29 2013-08-292     NaN    #or some other value

Here is my last attempt:

nat = np.datetime64('NaT')

df.loc[
(df["calv1"] == nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 0
#1 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 1
#2 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 2
#3 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] == nat),
"no_calv"] = 3
#4 or more calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] != nat),
"no_calv"] = 4

But the result is that the whole "no_calv" column is 4.0

I previously tried things like

..
(df["calv1"] != "NaT")
..

And

..
(df["calv1"] != pd.nat)
..

And the result was always 4.0 for the whole column or just NaN. I can't seem to find a way of telling python what the NaT values are?

Any tips and tricks for a new python user? I've done this both in SAS and in Fortran using if and elseif statements but I am trying to find the best way to do this in Python.

Edit: I'm really curious to know if this can be done by if or ifelse statements.

And now I'm also thinking I would like to be able to have other columns in the dataframe that contain extra info but are not needed for this exact purpose. An example (an added yx column):

id yx       calv1      calv2      calv3      calv4 no_calv
1  27  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2  34         NaT        NaT        NaT        NaT       0 
3  89  2006-08-29        NaT        NaT        NaT       1
4  23  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  11  2006-08-29 2013-08-29        NaT        NaT       2
6  43  2006-08-29        NaT 2013-08-29 2013-08-292     NaN    #or some other value

716

asked Jun 10 '21 15:06

Thordis

Video Answer

3 Answers

Another way of doing it using pd.Series.last_valid_index and pd.DataFrame.count:

>>> df2  = df.copy()
>>> df2.columns = np.arange(df2.shape[1]) + 1
>>> mask = (df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1))
>>> df.loc[mask, 'no_calv'] = df.notna().sum(1)
>>> df
         calv1       calv2       calv3        calv4  no_calv
id                                                          
1   2006-08-29  2007-08-29  2008-08-29   2009-08-29      4.0
2          NaN         NaN         NaN          NaN      0.0
3   2006-08-29         NaN         NaN          NaN      1.0
4   2006-08-29  2007-08-29  2010-08-29          NaN      3.0
5   2006-08-29  2013-08-29         NaN          NaN      2.0
6   2006-08-29         NaN  2013-08-29  2013-08-292      NaN

Explanation:

pd.Series.last_valid_index returns the position of last valid data in a series. Applying it on your rows will tell the column positions where last valid data is (after which there are all NaNs/NaTs).

Below I temporarily replaced the column names with integer indices and then applied pd.Series.last_valid_index on each row:

>>> df2.columns = np.arange(df2.shape[1]) + 1
>>> df2
             1           2           3            4
id                                                 
1   2006-08-29  2007-08-29  2008-08-29   2009-08-29
2          NaN         NaN         NaN          NaN
3   2006-08-29         NaN         NaN          NaN
4   2006-08-29  2007-08-29  2010-08-29          NaN
5   2006-08-29  2013-08-29         NaN          NaN
6   2006-08-29         NaN  2013-08-29  2013-08-292

>>> df2.apply(pd.Series.last_valid_index, axis=1).fillna(0)
id
1    4.0
2    0.0
3    1.0
4    3.0
5    2.0
6    4.0
dtype: float64

So on row 1, last valid data is in column 4, on row 2 there is no valid data, and so on.

Now let's count no. of valid data in each row:

>>> df2.count(axis=1)
id
1    4
2    0
3    1
4    3
5    2
6    3
dtype: int64

So, on row 1, there are 4 valid values, on row 2 no valid values, and so on. Now if all NaN/NaT values are towards the end of the row, the counts should match last valid data position we calculated above:

>>> df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1)
id
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool

So as seen, it matches on all rows except the last, because NaT appears in the middle of valid values in last row. We can use this as mask, and then fill the sum:

>>> mask = (df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1))
>>> df.loc[mask, 'no_calv'] = df.notna().sum(1)
>>> df
         calv1       calv2       calv3        calv4  no_calv
id                                                          
1   2006-08-29  2007-08-29  2008-08-29   2009-08-29      4.0
2          NaN         NaN         NaN          NaN      0.0
3   2006-08-29         NaN         NaN          NaN      1.0
4   2006-08-29  2007-08-29  2010-08-29          NaN      3.0
5   2006-08-29  2013-08-29         NaN          NaN      2.0
6   2006-08-29         NaN  2013-08-29  2013-08-292      NaN

answered Oct 17 '22 10:10

Ank

You can try the following, with df.interpolate:

>>> numeric = df.apply(lambda col: col.dt.day, axis=1)
# convert to something other than datetime

    calv1  calv2  calv3  calv4
id                            
1    29.0   29.0   29.0   29.0
2     NaN    NaN    NaN    NaN
3    29.0    NaN    NaN    NaN
4    29.0   29.0   29.0    NaN
5    29.0   29.0    NaN    NaN
6    29.0    NaN   29.0   29.0

>>> mask = (
        numeric.isna() != numeric.interpolate(limit_area='inside', axis=1).isna()
    ).any(1)
>>> mask
id
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

>>> df.loc[~mask, 'no_calv'] = df.notna().sum(1)
# Or,
# df['no_calv'] = np.where(mask, np.nan, df.notna().sum(1))
>>> df

        calv1      calv2      calv3      calv4  no_calv
id                                                     
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29      4.0
2         NaT        NaT        NaT        NaT      0.0
3  2006-08-29        NaT        NaT        NaT      1.0
4  2006-08-29 2007-08-29 2010-08-29        NaT      3.0
5  2006-08-29 2013-08-29        NaT        NaT      2.0
6  2006-08-29        NaT 2013-08-29 2013-08-29      NaN

What interpolate(limit_area='inside') does is, it only fills nans if there are valid values at either side. For example:

>>> numeric
    calv1  calv2  calv3  calv4
id                            
1    29.0   29.0   29.0   29.0
2     NaN    NaN    NaN    NaN
3    29.0    NaN    NaN    NaN
4    29.0   29.0   29.0    NaN
5    29.0   29.0    NaN    NaN
6    29.0    NaN   29.0   29.0

>>> numeric.interpolate(limit_area='inside', axis=1)
    calv1  calv2  calv3  calv4
id                            
1    29.0   29.0   29.0   29.0
2     NaN    NaN    NaN    NaN
3    29.0    NaN    NaN    NaN
4    29.0   29.0   29.0    NaN
5    29.0   29.0    NaN    NaN
6    29.0   29.0   29.0   29.0
             ^
   Only this on is filled

So if we compare which nan values from numeric do not match with interpolated numeric, we can find the rows where there are nan values in between valid values.

answered Oct 17 '22 08:10

Sayandip Dutta

To test if a value is NaT, use pd.isnull as shown in this answer. isnull matches None, NaN, and NaT.

You can build a function which does this check and sums all of the values until it hits a null value. For example:

import io
import numpy as np
import pandas as pd
df = pd.read_fwf(io.StringIO("""calv1      calv2      calv3      calv4 
2006-08-29 2007-08-29 2008-08-29 2009-08-29
       NaT        NaT        NaT        NaT         
2006-08-29        NaT        NaT        NaT
2006-08-29 2007-08-29 2010-08-29        NaT
2006-08-29 2013-08-29        NaT        NaT
2006-08-29        NaT 2013-08-29 2013-08-292"""))
df = df.replace("NaT", pd.NaT)

def count_non_nat(row):
    count = 0
    for i in row:
        if pd.isnull(i):
            if count < len(row.dropna()):
                return np.nan
            return count
        count += 1
    return count

# Apply this function row-wise (axis=1)
df['count'] = df.apply(count_non_nat, axis=1)

The output is a new column:

  calv1      calv2      calv3      calv4       count
0 2006-08-29 2007-08-29 2008-08-29 2009-08-29  4
1 NaT        NaT        NaT        NaT         0
2 2006-08-29 NaT        NaT        NaT         1
3 2006-08-29 2007-08-29 2010-08-29 NaT         3
4 2006-08-29 2013-08-29 NaT        NaT         2
5 2006-08-29 NaT        2013-08-29 2013-08-292 NaN

answered Oct 17 '22 10:10

SNygard

Related questions
                            
                                'tensorflow.python.framework.ops.EagerTensor' object has no attribute '_in_graph_mode'
                            
                                Typechecking dynamically added attributes
                            
                                Django + Uvicorn
                            
                                Problem installing scikit-image probably due to blosc
                            
                                How to make a python request to deepL API?
                            
                                get list of events associated with a DOM elements
                            
                                0 accuracy with LSTM
                            
                                No module named 'application' Error while deploying simple web app to Elastic Beanstalk
                            
                                Django, drf-yasg - how to add description to tags?
                            
                                How to disable pylint inspections for anything that uses my function?
                            
                                How to remove or hide y-axis ticklabels from a matplotlib / seaborn plot
                            
                                PyTorch torch.no_grad() versus requires_grad=False
                            
                                Celery - how to get task name by task id?
                            
                                How to make a Trainer pad inputs in a batch with huggingface-transformers?
                            
                                Python - iterate and update a nested dictionary & lists
                            
                                Merging multiple videos in a template/layout with Python FFMPEG?
                            
                                Calling R script from Python does not save log file in version 4
                            
                                Conda 3.9 is not working with numpy on macOS ("Reason: Image not found")?
                            
                                How can I rotate an image based on object position?
                            
                                What does this deprecation warning mean, and how to fix it?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With