Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create unique identifier in dataframe based on combination of columns

Tags:

python

pandas

I have the following dataframe:

    id  Lat         Lon         Year    Area    State
50319   -36.0629    -62.3423    2019    90  Iowa
18873   -36.0629    -62.3423    2017    90  Iowa
18876   -36.0754    -62.327     2017    124 Illinois
18878   -36.0688    -62.3353    2017    138 Kansas

I want to create a new column which assigns a unique identifier based on whether the columns Lat, Lon and Area have the same values. E.g. in this case rows 1 and 2 have the same values in those columns and will be given the same unique identifier 0_Iowa where Iowa comes from the State column. I tried using a for loop but is there a more pythonic way to do it?

id       Lat         Lon       Year    Area State   unique_id
50319   -36.0629    -62.3423    2019    90  Iowa    0_Iowa
18873   -36.0629    -62.3423    2017    90  Iowa    0_Iowa
18876   -36.0754    -62.327     2017    124 Illinois    1_Illinois
18878   -36.0688    -62.3353    2017    138 Kansas  2_Kansas
like image 260
user308827 Avatar asked Jan 25 '23 23:01

user308827


1 Answers

I'd go with groupby.ngroup setting sort=False for the grouping and str.cat to concatenate with State setting a separator:

df['Sate'] = (df.groupby(['Lat','Lon','Area'], sort=False)
                .ngroup() 
                .astype(str)
                .str.cat(df.State, sep='_'))

print(df)

      id      Lat      Lon  Year  Area     State        Sate
0  50319 -36.0629 -62.3423  2019    90      Iowa      0_Iowa
1  18873 -36.0629 -62.3423  2017    90      Iowa      0_Iowa
2  18876 -36.0754 -62.3270  2017   124  Illinois  1_Illinois
3  18878 -36.0688 -62.3353  2017   138    Kansas    2_Kansas
1
​
like image 193
yatu Avatar answered Jan 28 '23 12:01

yatu