Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use df.loc (or some other method) to make a new column based on specific conditions?

I have a dataframe that contains 5 columns and I am using pandas and numpy to edit and work with the data.

id      calv1      calv2      calv3      calv4 
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29
2         NaT        NaT        NaT        NaT         
3  2006-08-29        NaT        NaT        NaT
4  2006-08-29 2007-08-29 2010-08-29        NaT
5  2006-08-29 2013-08-29        NaT        NaT
6  2006-08-29        NaT 2013-08-29 2013-08-292

I want to create another column that counts the number of "calv" that occur for each id. However it matters to me if there are missing values inbetween other values, see row 6. Then I want there to be a NaN or perhaps some other value indicating this is not a correct row.

id      calv1      calv2      calv3      calv4 no_calv
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2         NaT        NaT        NaT        NaT       0 
3  2006-08-29        NaT        NaT        NaT       1
4  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  2006-08-29 2013-08-29        NaT        NaT       2
6  2006-08-29        NaT 2013-08-29 2013-08-292     NaN    #or some other value

Here is my last attempt:

nat = np.datetime64('NaT')

df.loc[
(df["calv1"] == nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 0
#1 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 1
#2 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 2
#3 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] == nat),
"no_calv"] = 3
#4 or more calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] != nat),
"no_calv"] = 4

But the result is that the whole "no_calv" column is 4.0

I previously tried things like

..
(df["calv1"] != "NaT")
..

And

..
(df["calv1"] != pd.nat)
..

And the result was always 4.0 for the whole column or just NaN. I can't seem to find a way of telling python what the NaT values are?

Any tips and tricks for a new python user? I've done this both in SAS and in Fortran using if and elseif statements but I am trying to find the best way to do this in Python.

Edit: I'm really curious to know if this can be done by if or ifelse statements.

And now I'm also thinking I would like to be able to have other columns in the dataframe that contain extra info but are not needed for this exact purpose. An example (an added yx column):

id yx       calv1      calv2      calv3      calv4 no_calv
1  27  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2  34         NaT        NaT        NaT        NaT       0 
3  89  2006-08-29        NaT        NaT        NaT       1
4  23  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  11  2006-08-29 2013-08-29        NaT        NaT       2
6  43  2006-08-29        NaT 2013-08-29 2013-08-292     NaN    #or some other value
like image 716
Thordis Avatar asked Jun 10 '21 15:06

Thordis


People also ask

How do I change DataFrame column values based on conditions?

You can replace all values or selected values in a column of pandas DataFrame based on condition by using DataFrame. loc[] , np. where() and DataFrame. mask() methods.

How do I create a new DF with specific columns?

You can create a new DataFrame of a specific column by using DataFrame. assign() method. The assign() method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.

How do I use the LOC method on a Dataframe?

When using the loc method on a dataframe, we specify which rows and which columns we want by using the following format: There are different ways to specify which rows and columns we want to select. For example, we can pass in a single label, a list or array of labels, a slice object with labels, or a boolean array.

How does Loc get all the columns in a list?

There’s one important note about the ‘column’ label. If you don’t provide a column label, loc will retrieve all columns by default. Essentially, it’s optional to provide the column label. If you leave it out, loc [] will get all of the columns. Ok. Now that I’ve explained the syntax at a high level, let’s take a look at some concrete examples.

What happens if you don’t provide a column label in Loc?

If you don’t provide a column label, loc will retrieve all columns by default. Essentially, it’s optional to provide the column label. If you leave it out, loc [] will get all of the columns. Ok.

How do I select a single cell of data using Loc?

To select a single cell of data using loc is pretty simple, if you already know how to select rows and columns. Essentially, we’re going to supply both a row label and a column label inside of loc []. Which produces the following output: This is pretty straightforward.


Video Answer


3 Answers

Another way of doing it using pd.Series.last_valid_index and pd.DataFrame.count:

>>> df2  = df.copy()
>>> df2.columns = np.arange(df2.shape[1]) + 1
>>> mask = (df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1))
>>> df.loc[mask, 'no_calv'] = df.notna().sum(1)
>>> df
         calv1       calv2       calv3        calv4  no_calv
id                                                          
1   2006-08-29  2007-08-29  2008-08-29   2009-08-29      4.0
2          NaN         NaN         NaN          NaN      0.0
3   2006-08-29         NaN         NaN          NaN      1.0
4   2006-08-29  2007-08-29  2010-08-29          NaN      3.0
5   2006-08-29  2013-08-29         NaN          NaN      2.0
6   2006-08-29         NaN  2013-08-29  2013-08-292      NaN

Explanation:

pd.Series.last_valid_index returns the position of last valid data in a series. Applying it on your rows will tell the column positions where last valid data is (after which there are all NaNs/NaTs).

Below I temporarily replaced the column names with integer indices and then applied pd.Series.last_valid_index on each row:

>>> df2.columns = np.arange(df2.shape[1]) + 1
>>> df2
             1           2           3            4
id                                                 
1   2006-08-29  2007-08-29  2008-08-29   2009-08-29
2          NaN         NaN         NaN          NaN
3   2006-08-29         NaN         NaN          NaN
4   2006-08-29  2007-08-29  2010-08-29          NaN
5   2006-08-29  2013-08-29         NaN          NaN
6   2006-08-29         NaN  2013-08-29  2013-08-292

>>> df2.apply(pd.Series.last_valid_index, axis=1).fillna(0)
id
1    4.0
2    0.0
3    1.0
4    3.0
5    2.0
6    4.0
dtype: float64

So on row 1, last valid data is in column 4, on row 2 there is no valid data, and so on.

Now let's count no. of valid data in each row:

>>> df2.count(axis=1)
id
1    4
2    0
3    1
4    3
5    2
6    3
dtype: int64

So, on row 1, there are 4 valid values, on row 2 no valid values, and so on. Now if all NaN/NaT values are towards the end of the row, the counts should match last valid data position we calculated above:

>>> df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1)
id
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool

So as seen, it matches on all rows except the last, because NaT appears in the middle of valid values in last row. We can use this as mask, and then fill the sum:

>>> mask = (df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1))
>>> df.loc[mask, 'no_calv'] = df.notna().sum(1)
>>> df
         calv1       calv2       calv3        calv4  no_calv
id                                                          
1   2006-08-29  2007-08-29  2008-08-29   2009-08-29      4.0
2          NaN         NaN         NaN          NaN      0.0
3   2006-08-29         NaN         NaN          NaN      1.0
4   2006-08-29  2007-08-29  2010-08-29          NaN      3.0
5   2006-08-29  2013-08-29         NaN          NaN      2.0
6   2006-08-29         NaN  2013-08-29  2013-08-292      NaN
like image 58
Ank Avatar answered Oct 17 '22 10:10

Ank


You can try the following, with df.interpolate:

>>> numeric = df.apply(lambda col: col.dt.day, axis=1)
# convert to something other than datetime

    calv1  calv2  calv3  calv4
id                            
1    29.0   29.0   29.0   29.0
2     NaN    NaN    NaN    NaN
3    29.0    NaN    NaN    NaN
4    29.0   29.0   29.0    NaN
5    29.0   29.0    NaN    NaN
6    29.0    NaN   29.0   29.0

>>> mask = (
        numeric.isna() != numeric.interpolate(limit_area='inside', axis=1).isna()
    ).any(1)
>>> mask
id
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

>>> df.loc[~mask, 'no_calv'] = df.notna().sum(1)
# Or,
# df['no_calv'] = np.where(mask, np.nan, df.notna().sum(1))
>>> df

        calv1      calv2      calv3      calv4  no_calv
id                                                     
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29      4.0
2         NaT        NaT        NaT        NaT      0.0
3  2006-08-29        NaT        NaT        NaT      1.0
4  2006-08-29 2007-08-29 2010-08-29        NaT      3.0
5  2006-08-29 2013-08-29        NaT        NaT      2.0
6  2006-08-29        NaT 2013-08-29 2013-08-29      NaN

What interpolate(limit_area='inside') does is, it only fills nans if there are valid values at either side. For example:

>>> numeric
    calv1  calv2  calv3  calv4
id                            
1    29.0   29.0   29.0   29.0
2     NaN    NaN    NaN    NaN
3    29.0    NaN    NaN    NaN
4    29.0   29.0   29.0    NaN
5    29.0   29.0    NaN    NaN
6    29.0    NaN   29.0   29.0

>>> numeric.interpolate(limit_area='inside', axis=1)
    calv1  calv2  calv3  calv4
id                            
1    29.0   29.0   29.0   29.0
2     NaN    NaN    NaN    NaN
3    29.0    NaN    NaN    NaN
4    29.0   29.0   29.0    NaN
5    29.0   29.0    NaN    NaN
6    29.0   29.0   29.0   29.0
             ^
   Only this on is filled

So if we compare which nan values from numeric do not match with interpolated numeric, we can find the rows where there are nan values in between valid values.

like image 36
Sayandip Dutta Avatar answered Oct 17 '22 08:10

Sayandip Dutta


To test if a value is NaT, use pd.isnull as shown in this answer. isnull matches None, NaN, and NaT.

You can build a function which does this check and sums all of the values until it hits a null value. For example:

import io
import numpy as np
import pandas as pd
df = pd.read_fwf(io.StringIO("""calv1      calv2      calv3      calv4 
2006-08-29 2007-08-29 2008-08-29 2009-08-29
       NaT        NaT        NaT        NaT         
2006-08-29        NaT        NaT        NaT
2006-08-29 2007-08-29 2010-08-29        NaT
2006-08-29 2013-08-29        NaT        NaT
2006-08-29        NaT 2013-08-29 2013-08-292"""))
df = df.replace("NaT", pd.NaT)

def count_non_nat(row):
    count = 0
    for i in row:
        if pd.isnull(i):
            if count < len(row.dropna()):
                return np.nan
            return count
        count += 1
    return count

# Apply this function row-wise (axis=1)
df['count'] = df.apply(count_non_nat, axis=1)

The output is a new column:

  calv1      calv2      calv3      calv4       count
0 2006-08-29 2007-08-29 2008-08-29 2009-08-29  4
1 NaT        NaT        NaT        NaT         0
2 2006-08-29 NaT        NaT        NaT         1
3 2006-08-29 2007-08-29 2010-08-29 NaT         3
4 2006-08-29 2013-08-29 NaT        NaT         2
5 2006-08-29 NaT        2013-08-29 2013-08-292 NaN
like image 1
SNygard Avatar answered Oct 17 '22 10:10

SNygard