I have a dataframe that contains 5 columns and I am using pandas and numpy to edit and work with the data.
id calv1 calv2 calv3 calv4
1 2006-08-29 2007-08-29 2008-08-29 2009-08-29
2 NaT NaT NaT NaT
3 2006-08-29 NaT NaT NaT
4 2006-08-29 2007-08-29 2010-08-29 NaT
5 2006-08-29 2013-08-29 NaT NaT
6 2006-08-29 NaT 2013-08-29 2013-08-292
I want to create another column that counts the number of "calv" that occur for each id. However it matters to me if there are missing values inbetween other values, see row 6. Then I want there to be a NaN or perhaps some other value indicating this is not a correct row.
id calv1 calv2 calv3 calv4 no_calv
1 2006-08-29 2007-08-29 2008-08-29 2009-08-29 4
2 NaT NaT NaT NaT 0
3 2006-08-29 NaT NaT NaT 1
4 2006-08-29 2007-08-29 2010-08-29 NaT 3
5 2006-08-29 2013-08-29 NaT NaT 2
6 2006-08-29 NaT 2013-08-29 2013-08-292 NaN #or some other value
Here is my last attempt:
nat = np.datetime64('NaT')
df.loc[
(df["calv1"] == nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 0
#1 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 1
#2 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 2
#3 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] == nat),
"no_calv"] = 3
#4 or more calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] != nat),
"no_calv"] = 4
But the result is that the whole "no_calv" column is 4.0
I previously tried things like
..
(df["calv1"] != "NaT")
..
And
..
(df["calv1"] != pd.nat)
..
And the result was always 4.0 for the whole column or just NaN. I can't seem to find a way of telling python what the NaT values are?
Any tips and tricks for a new python user? I've done this both in SAS and in Fortran using if and elseif statements but I am trying to find the best way to do this in Python.
Edit: I'm really curious to know if this can be done by if or ifelse statements.
And now I'm also thinking I would like to be able to have other columns in the dataframe that contain extra info but are not needed for this exact purpose. An example (an added yx column):
id yx calv1 calv2 calv3 calv4 no_calv
1 27 2006-08-29 2007-08-29 2008-08-29 2009-08-29 4
2 34 NaT NaT NaT NaT 0
3 89 2006-08-29 NaT NaT NaT 1
4 23 2006-08-29 2007-08-29 2010-08-29 NaT 3
5 11 2006-08-29 2013-08-29 NaT NaT 2
6 43 2006-08-29 NaT 2013-08-29 2013-08-292 NaN #or some other value
You can replace all values or selected values in a column of pandas DataFrame based on condition by using DataFrame. loc[] , np. where() and DataFrame. mask() methods.
You can create a new DataFrame of a specific column by using DataFrame. assign() method. The assign() method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.
When using the loc method on a dataframe, we specify which rows and which columns we want by using the following format: There are different ways to specify which rows and columns we want to select. For example, we can pass in a single label, a list or array of labels, a slice object with labels, or a boolean array.
There’s one important note about the ‘column’ label. If you don’t provide a column label, loc will retrieve all columns by default. Essentially, it’s optional to provide the column label. If you leave it out, loc [] will get all of the columns. Ok. Now that I’ve explained the syntax at a high level, let’s take a look at some concrete examples.
If you don’t provide a column label, loc will retrieve all columns by default. Essentially, it’s optional to provide the column label. If you leave it out, loc [] will get all of the columns. Ok.
To select a single cell of data using loc is pretty simple, if you already know how to select rows and columns. Essentially, we’re going to supply both a row label and a column label inside of loc []. Which produces the following output: This is pretty straightforward.
Another way of doing it using pd.Series.last_valid_index
and pd.DataFrame.count
:
>>> df2 = df.copy()
>>> df2.columns = np.arange(df2.shape[1]) + 1
>>> mask = (df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1))
>>> df.loc[mask, 'no_calv'] = df.notna().sum(1)
>>> df
calv1 calv2 calv3 calv4 no_calv
id
1 2006-08-29 2007-08-29 2008-08-29 2009-08-29 4.0
2 NaN NaN NaN NaN 0.0
3 2006-08-29 NaN NaN NaN 1.0
4 2006-08-29 2007-08-29 2010-08-29 NaN 3.0
5 2006-08-29 2013-08-29 NaN NaN 2.0
6 2006-08-29 NaN 2013-08-29 2013-08-292 NaN
pd.Series.last_valid_index
returns the position of last valid data in a series. Applying it on your rows will tell the column positions where last valid data is (after which there are all NaNs/NaTs
).
Below I temporarily replaced the column names with integer indices and then applied pd.Series.last_valid_index
on each row:
>>> df2.columns = np.arange(df2.shape[1]) + 1
>>> df2
1 2 3 4
id
1 2006-08-29 2007-08-29 2008-08-29 2009-08-29
2 NaN NaN NaN NaN
3 2006-08-29 NaN NaN NaN
4 2006-08-29 2007-08-29 2010-08-29 NaN
5 2006-08-29 2013-08-29 NaN NaN
6 2006-08-29 NaN 2013-08-29 2013-08-292
>>> df2.apply(pd.Series.last_valid_index, axis=1).fillna(0)
id
1 4.0
2 0.0
3 1.0
4 3.0
5 2.0
6 4.0
dtype: float64
So on row 1, last valid data is in column 4, on row 2 there is no valid data, and so on.
Now let's count no. of valid data in each row:
>>> df2.count(axis=1)
id
1 4
2 0
3 1
4 3
5 2
6 3
dtype: int64
So, on row 1, there are 4 valid values, on row 2 no valid values, and so on. Now if all NaN/NaT
values are towards the end of the row, the counts should match last valid data position we calculated above:
>>> df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1)
id
1 True
2 True
3 True
4 True
5 True
6 False
dtype: bool
So as seen, it matches on all rows except the last, because NaT appears in the middle of valid values in last row. We can use this as mask, and then fill the sum:
>>> mask = (df2.apply(pd.Series.last_valid_index, axis=1).fillna(0) == df2.count(axis=1))
>>> df.loc[mask, 'no_calv'] = df.notna().sum(1)
>>> df
calv1 calv2 calv3 calv4 no_calv
id
1 2006-08-29 2007-08-29 2008-08-29 2009-08-29 4.0
2 NaN NaN NaN NaN 0.0
3 2006-08-29 NaN NaN NaN 1.0
4 2006-08-29 2007-08-29 2010-08-29 NaN 3.0
5 2006-08-29 2013-08-29 NaN NaN 2.0
6 2006-08-29 NaN 2013-08-29 2013-08-292 NaN
You can try the following, with df.interpolate
:
>>> numeric = df.apply(lambda col: col.dt.day, axis=1)
# convert to something other than datetime
calv1 calv2 calv3 calv4
id
1 29.0 29.0 29.0 29.0
2 NaN NaN NaN NaN
3 29.0 NaN NaN NaN
4 29.0 29.0 29.0 NaN
5 29.0 29.0 NaN NaN
6 29.0 NaN 29.0 29.0
>>> mask = (
numeric.isna() != numeric.interpolate(limit_area='inside', axis=1).isna()
).any(1)
>>> mask
id
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
>>> df.loc[~mask, 'no_calv'] = df.notna().sum(1)
# Or,
# df['no_calv'] = np.where(mask, np.nan, df.notna().sum(1))
>>> df
calv1 calv2 calv3 calv4 no_calv
id
1 2006-08-29 2007-08-29 2008-08-29 2009-08-29 4.0
2 NaT NaT NaT NaT 0.0
3 2006-08-29 NaT NaT NaT 1.0
4 2006-08-29 2007-08-29 2010-08-29 NaT 3.0
5 2006-08-29 2013-08-29 NaT NaT 2.0
6 2006-08-29 NaT 2013-08-29 2013-08-29 NaN
What interpolate(limit_area='inside')
does is, it only fills nan
s if there are valid values at either side.
For example:
>>> numeric
calv1 calv2 calv3 calv4
id
1 29.0 29.0 29.0 29.0
2 NaN NaN NaN NaN
3 29.0 NaN NaN NaN
4 29.0 29.0 29.0 NaN
5 29.0 29.0 NaN NaN
6 29.0 NaN 29.0 29.0
>>> numeric.interpolate(limit_area='inside', axis=1)
calv1 calv2 calv3 calv4
id
1 29.0 29.0 29.0 29.0
2 NaN NaN NaN NaN
3 29.0 NaN NaN NaN
4 29.0 29.0 29.0 NaN
5 29.0 29.0 NaN NaN
6 29.0 29.0 29.0 29.0
^
Only this on is filled
So if we compare which nan
values from numeric
do not match with interpolated numeric
, we can find the rows where there are nan
values in between valid values.
To test if a value is NaT
, use pd.isnull
as shown in this answer. isnull
matches None
, NaN
, and NaT
.
You can build a function which does this check and sums all of the values until it hits a null value. For example:
import io
import numpy as np
import pandas as pd
df = pd.read_fwf(io.StringIO("""calv1 calv2 calv3 calv4
2006-08-29 2007-08-29 2008-08-29 2009-08-29
NaT NaT NaT NaT
2006-08-29 NaT NaT NaT
2006-08-29 2007-08-29 2010-08-29 NaT
2006-08-29 2013-08-29 NaT NaT
2006-08-29 NaT 2013-08-29 2013-08-292"""))
df = df.replace("NaT", pd.NaT)
def count_non_nat(row):
count = 0
for i in row:
if pd.isnull(i):
if count < len(row.dropna()):
return np.nan
return count
count += 1
return count
# Apply this function row-wise (axis=1)
df['count'] = df.apply(count_non_nat, axis=1)
The output is a new column:
calv1 calv2 calv3 calv4 count
0 2006-08-29 2007-08-29 2008-08-29 2009-08-29 4
1 NaT NaT NaT NaT 0
2 2006-08-29 NaT NaT NaT 1
3 2006-08-29 2007-08-29 2010-08-29 NaT 3
4 2006-08-29 2013-08-29 NaT NaT 2
5 2006-08-29 NaT 2013-08-29 2013-08-292 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With