I am trying to populate a new column in my pandas dataframe by considering the values of the previous n rows. If the current value is not equal to any of the past n values in that column, it should populate "N", else "Y".
Please let me know what would be a good way to achieve this.
Here's my input data :
testdata = {'col1' :['car','car','car','bus','bus','bus','car']}
df = pd.DataFrame.from_dict(testdata)
Input DF:
col1
0 car
1 car
2 car
3 bus
4 bus
5 car
6 car
Output DF (with n=2):
col1 Result
0 car
1 car
2 car Y
3 bus N
4 bus Y
5 bus Y
6 car N
You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.
When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.
You can use the DataFrame.diff () function to find the difference between two rows in a pandas DataFrame. periods: The number of previous rows for calculating the difference. axis: Find difference over rows (0) or columns (1). The following examples show how to use this function in practice.
Because of this, we can easily use the shift method to subtract between rows. The Pandas shift method offers a pre-step to calculating the difference between two rows by letting you see the data directly. The Pandas diff method simply calculates the difference, thereby abstracting the calculation.
As you can see, both data sets contain the same column names, but partly different rows. In this example, I’ll show how to compare two pandas DataFrames with different lengths and identify all different rows that are contained in only one of the two DataFrames.
Determine which axis to align the comparison on. with rows drawn alternately from self and other. with columns drawn alternately from self and other. If true, all rows and columns are kept. Otherwise, only the ones with different values are kept. If true, the result keeps values that are equal.
Here is my way
n=2
l=[False]*n+[df.iloc[x,0] in df.iloc[x-n:x,0].tolist() for x in np.arange(n,len(df))]
df['New']=l
df
col1 New
0 car False
1 car False
2 car True
3 bus False
4 bus True
5 bus True
6 car False
You can do this with a Rolling.apply
call.
n = 2
res = (df['col1'].astype('category')
.cat.codes
.rolling(n+1)
.apply(lambda x: x[-1] in x[:-1], raw=True))
df['Result'] = np.where(res == 1, 'Y', 'N')
df
col1 Result
0 car N
1 car N
2 car Y
3 bus N
4 bus Y
5 bus Y
6 car N
Rolling only works with numeric data, so the initial step is to factorise it. This can be done in many ways, I've used astype('category')
and then extracted the codes.
Another option is using pd.Categorical
for the conversion,
res = (df.assign(col1=pd.Categorical(df['col1']).codes)['col1']
.rolling(n+1)
.apply(lambda x: x[-1] in x[:-1], raw=True))
df['Result'] = res.map({1: 'Y', 0: 'N'})
df
col1 Result
0 car NaN
1 car NaN
2 car Y
3 bus N
4 bus Y
5 bus Y
6 car N
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With