Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare the previous N rows to the current row in a pandas column

I am trying to populate a new column in my pandas dataframe by considering the values of the previous n rows. If the current value is not equal to any of the past n values in that column, it should populate "N", else "Y".

Please let me know what would be a good way to achieve this.

Here's my input data :

testdata = {'col1' :['car','car','car','bus','bus','bus','car']}
df = pd.DataFrame.from_dict(testdata)

Input DF:

  col1
0  car
1  car
2  car
3  bus
4  bus
5  car  
6  car

Output DF (with n=2):

  col1   Result
0  car         
1  car         
2  car      Y  
3  bus      N  
4  bus      Y  
5  bus      Y  
6  car      N 
like image 872
tipsydeepz Avatar asked Jun 13 '19 03:06

tipsydeepz


People also ask

How do I compare row values in Pandas?

You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.

What is the difference between ILOC () and LOC ()?

When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.

How to find the difference between two rows in a pandas Dataframe?

You can use the DataFrame.diff () function to find the difference between two rows in a pandas DataFrame. periods: The number of previous rows for calculating the difference. axis: Find difference over rows (0) or columns (1). The following examples show how to use this function in practice.

How do you subtract between two rows in pandas?

Because of this, we can easily use the shift method to subtract between rows. The Pandas shift method offers a pre-step to calculating the difference between two rows by letting you see the data directly. The Pandas diff method simply calculates the difference, thereby abstracting the calculation.

Can two DataFrames have the same column names but different rows?

As you can see, both data sets contain the same column names, but partly different rows. In this example, I’ll show how to compare two pandas DataFrames with different lengths and identify all different rows that are contained in only one of the two DataFrames.

How do you compare values between two rows and columns?

Determine which axis to align the comparison on. with rows drawn alternately from self and other. with columns drawn alternately from self and other. If true, all rows and columns are kept. Otherwise, only the ones with different values are kept. If true, the result keeps values that are equal.


2 Answers

Here is my way

n=2
l=[False]*n+[df.iloc[x,0] in df.iloc[x-n:x,0].tolist() for x in np.arange(n,len(df))]
df['New']=l
df
  col1    New
0  car  False
1  car  False
2  car   True
3  bus  False
4  bus   True
5  bus   True
6  car  False
like image 113
BENY Avatar answered Oct 23 '22 07:10

BENY


You can do this with a Rolling.apply call.

n = 2
res = (df['col1'].astype('category')
                 .cat.codes
                 .rolling(n+1)
                 .apply(lambda x: x[-1] in x[:-1], raw=True))

df['Result'] = np.where(res == 1, 'Y', 'N')
df

  col1 Result
0  car      N
1  car      N
2  car      Y
3  bus      N
4  bus      Y
5  bus      Y
6  car      N

Rolling only works with numeric data, so the initial step is to factorise it. This can be done in many ways, I've used astype('category') and then extracted the codes.


Another option is using pd.Categorical for the conversion,

res = (df.assign(col1=pd.Categorical(df['col1']).codes)['col1']
         .rolling(n+1)
         .apply(lambda x: x[-1] in x[:-1], raw=True))

df['Result'] = res.map({1: 'Y', 0: 'N'})
df

  col1 Result
0  car    NaN
1  car    NaN
2  car      Y
3  bus      N
4  bus      Y
5  bus      Y
6  car      N
like image 45
cs95 Avatar answered Oct 23 '22 05:10

cs95