I would like to generate an integer-based unique ID for users (in my df). Let's say I have: <pre class="prettyprint"><code>index first last dob 0 peter jones 20000101 1 john doe 19870105 2 adam smith 19441212 3 john doe 19870105 4 jenny fast 19640822 </code></pre> I would like to generate an ID column like so: <pre class="prettyprint"><code>index first last dob id 0 peter jones 20000101 1244821450 1 john doe 19870105 1742118427 2 adam smith 19441212 1841181386 3 john doe 19870105 1742118427 4 jenny fast 19640822 1687411973 </code></pre> 10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID). I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier. <ul> <li>I can't use groupby/cat code type methods in case the order of the rows change.</li> <li>The dataset won't grow beyond 50k rows.</li> <li>Safe to assume there won't be a first, last, dob duplicate.</li> </ul> Feel like I may be tackling this the wrong way as I can't find much literature on it! Thanks

You can try using hash function. <pre class="prettyprint"><code>df['id'] = df[['first', 'last']].sum(axis=1).map(hash) </code></pre> Please note the hash id is greater than 10 digits and is a unique integer sequence.

Pandas - Generate Unique ID based on row values

Tags:

python

pandas

hash

I would like to generate an integer-based unique ID for users (in my df).

Let's say I have:

Click to copy

index  first  last    dob
0      peter  jones   20000101
1      john   doe     19870105
2      adam   smith   19441212
3      john   doe     19870105
4      jenny  fast    19640822

I would like to generate an ID column like so:

Click to copy

index  first  last    dob       id
0      peter  jones   20000101  1244821450
1      john   doe     19870105  1742118427
2      adam   smith   19441212  1841181386
3      john   doe     19870105  1742118427
4      jenny  fast    19640822  1687411973

10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID).

I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier.

I can't use groupby/cat code type methods in case the order of the rows change.
The dataset won't grow beyond 50k rows.
Safe to assume there won't be a first, last, dob duplicate.

Feel like I may be tackling this the wrong way as I can't find much literature on it!

Thanks

1000

asked Feb 25 '20 11:02

swifty

1 Answers

You can try using hash function.

Click to copy

df['id'] = df[['first', 'last']].sum(axis=1).map(hash)

Please note the hash id is greater than 10 digits and is a unique integer sequence.

150

answered Sep 18 '22 01:09

Mahendra Singh

Related questions
                            
                                How to reference static method from class variable [duplicate]
                            
                                Permutations of a list with 16 integers but only if 4 conditions are fulfilled
                            
                                How can I rotate a matplotlib map?
                            
                                How to get the mode of distribution in scipy.stats
                            
                                What's the difference between auto_remove and remove in Docker SDK for python
                            
                                Why are deep learning libraries so huge?
                            
                                How to use nox with poetry?
                            
                                Split a list of dates into subsets of consecutive dates
                            
                                Visual Studio Code syntax highlighting not working
                            
                                Reading .dat file in python
                            
                                Feeding nullable data from BigQuery into Tensorflow Transform
                            
                                Does the django_address module provide a way to seed the initial country data?
                            
                                How to generate asgi.py for existent project?
                            
                                How do I correctly use mock call_args with Python's unittest.mock?
                            
                                Flask endpoint vs Sagemaker endpoint
                            
                                which python vs PYTHONPATH
                            
                                Do I need to split the data for isolation forest?
                            
                                Is it true that in multiprocessing, each process gets it's own GIL in CPython? How different is that from creating new runtimes?
                            
                                Django & mypy: ValuesQuerySet type hint
                            
                                How to process huge datasets in kedro

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas - Generate Unique ID based on row values

Tags:

python

pandas

hash

swifty

People also ask

1 Answers

Mahendra Singh

Recent Activity

Donate For Us