Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flag repeating entries in pandas time series

I have a data frame that takes this form (but is several millions of rows long):

import pandas as pd     
dict = {'id':["A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"], 
    'year': ["2000", "2001", "2002", "2000", "2001", "2003", "1999", "2000", "2001", "2000", "2000", "2001"],
    'vacation':["France", "Morocco", "Morocco", "Germany", "Germany", "Germany", "Japan", "Australia", "Japan", "Canada", "Mexico", "China"],
    'new':[1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]} 
df = pd.DataFrame(dict)

A   2000    France
A   2001    Morocco
A   2002    Morocco
B   2000    Germany
B   2001    Germany
B   2003    Germany
C   1999    Japan
C   2000    Australia
C   2001    Japan
D   2000    Canada       
D   2000    Mexico       
D   2001    China        

For each person in each year, the holiday destination(s) is/are given; there can be multiple holiday destinations in a given year. I would like to flag the rows when a participant goes to holiday to a destination to which they had not gone the year before (i.e., the destination is new). In the case above, the output would be:

id  year    vacation    new
A   2000    France       1
A   2001    Morocco      1
A   2002    Morocco      0
B   2001    Germany      1
B   2002    Germany      0
B   2003    Germany      0
C   1999    Japan        1
C   1999    Australia    1
C   2000    Japan        1
D   2000    Canada       1
D   2000    Mexico       1
D   2001    China        1

For A, B, C, and D, the first holiday destination in our data frame is flagged as new. When A goes to Morocco two years in a row, the 2nd occurrence is not flagged, because A went there the year before. When B goes to Germany 3 times in a row, the 2nd and 3rd occurrences are not flagged. When person C goes to Japan twice, all of the occurrences are flagged, because they did not go to Japan two years in a row. D goes to 3 different destinations (albeit to 2 destinations in 2000) and all of them are flagged.

I have been trying to solve it myself, but have not been able to break away from iterations, which are too computationally intensive for such a massive dataset.

I'd appreciate any input; thanks.

like image 428
JuM24 Avatar asked Jan 25 '23 10:01

JuM24


2 Answers

IIUC,

what we are doing is grouping by id & vacation and ensuring that year is not equal to the year above, or we can selecting the first instance of that combination.

hopefully that's clear. let me know if you need anymore help.

df["new_2"] = (
    df.groupby(["id", "vacation"])["id", "year"]
    .apply(lambda x: x.ne(x.shift()))
    .all(axis=1)
    .add(0)
)

print(df)
  id  year   vacation  new_2
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1
like image 148
Umar.H Avatar answered Feb 09 '23 01:02

Umar.H


Here's one solution I came up with, using groupby and transform:

df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
    df.groupby(["id", "vacation"])
    .transform(lambda x: x.iloc[0])
    .year.eq(df.year)
    .astype(int)
)

You'll get

  id  year   vacation  new
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1
like image 30
ignoring_gravity Avatar answered Feb 09 '23 02:02

ignoring_gravity