Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Generate Unique ID based on row values

I would like to generate an integer-based unique ID for users (in my df).

Let's say I have:

index  first  last    dob
0      peter  jones   20000101
1      john   doe     19870105
2      adam   smith   19441212
3      john   doe     19870105
4      jenny  fast    19640822

I would like to generate an ID column like so:

index  first  last    dob       id
0      peter  jones   20000101  1244821450
1      john   doe     19870105  1742118427
2      adam   smith   19441212  1841181386
3      john   doe     19870105  1742118427
4      jenny  fast    19640822  1687411973

10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID).

I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier.

  • I can't use groupby/cat code type methods in case the order of the rows change.
  • The dataset won't grow beyond 50k rows.
  • Safe to assume there won't be a first, last, dob duplicate.

Feel like I may be tackling this the wrong way as I can't find much literature on it!

Thanks

like image 1000
swifty Avatar asked Feb 25 '20 11:02

swifty


People also ask

How do I get unique row values in pandas?

And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])

How do you create a unique ID in Python?

uuid1() is defined in UUID library and helps to generate the random id using MAC address and time component. bytes : Returns id in form of 16 byte string. int : Returns id in form of 128-bit integer. hex : Returns random id as 32 character hexadecimal string.

How do you generate unique identifiers?

The simplest way to generate identifiers is by a serial number. A steadily increasing number that is assigned to whatever you need to identify next. This is the approached used in most internal databases as well as some commonly encountered public identifiers.

How do I get unique list in pandas?

List of all unique values in a pandas dataframe column. You can use the pandas unique() function to get the different unique values present in a column. It returns a numpy array of the unique values in the column.


1 Answers

You can try using hash function.

df['id'] = df[['first', 'last']].sum(axis=1).map(hash)

Please note the hash id is greater than 10 digits and is a unique integer sequence.

like image 150
Mahendra Singh Avatar answered Sep 18 '22 01:09

Mahendra Singh