Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df

Tags:

I'd like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here's an example of the df:

      FirstName LastName  id 0     Tom       Jones     1 1     Tom       Jones     1 2     David     Smith     1 3     Alex      Thompson  1 4     Alex      Thompson  1

So, basically I want all entries for Tom Jones to have id=1, all entries for David Smith to have id=2, all entries for Alex Thompson to have id=3, and so on.

So I already have one solution, which is a dead simple python loop iterating two values (One for id, one for index) and assigning the individual an id based on whether they match the previous individual:

x = 1 i = 1  while i < len(df_test):     if (df_test.LastName[i] == df_test.LastName[i-1]) &      (df_test.FirstName[i] == df_test.FirstName[i-1]):         df_test.loc[i, 'id'] = x         i = i+1     else:         x = x+1         df_test.loc[i, 'id'] = x         i = i+1

The problem I'm running into is that the DataFrame has about 9 million entries, so with that loop it would have taken a huge amount of time to run. Can anyone think of a more efficient way to do this? I've been looking at groupby and multiindexing as potential solutions, but haven't quite found the right solution yet.

998

asked Aug 15 '17 01:08

Simon Sharp

1 Answers

This approach uses .groupby() and .ngroup() (new in Pandas 0.20.2) to create the id column:

df['id'] = df.groupby(['LastName','FirstName']).ngroup() >>> df     First    Second  id 0    Tom     Jones   0 1    Tom     Jones   0 2  David     Smith   1 3   Alex  Thompson   2 4   Alex  Thompson   2

I checked timings and, for the small dataset in this example, Alexander's answer is faster:

%timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes) 1000 loops, best of 3: 848 µs per loop  %timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup()) 1000 loops, best of 3: 1.22 ms per loop

However, for larger dataframes, the groupby() approach appears to be faster. To create a large, representative data set, I used faker to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.

import faker fakenames = faker.Faker() first = [ fakenames.first_name() for _ in range(5000) ] last = [ fakenames.last_name() for _ in range(5000) ] df2 = pd.DataFrame({'FirstName':first, 'LastName':last}) df2 = pd.concat([df2, df2.iloc[:2000]])

Running the timing on this larger data set gives:

%timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes) 100 loops, best of 3: 5.22 ms per loop  %timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup()) 100 loops, best of 3: 3.1 ms per loop

You may want to test both approaches on your data set to determine which one works best given the size of your data.

answered Sep 30 '22 21:09

Craig

Related questions
                            
                                Detecting a US Holiday
                            
                                How to determine what user and group a Python script is running as?
                            
                                Best way to do a "not None" test in Python for a normal and Unicode empty string?
                            
                                Merging sublists in python [duplicate]
                            
                                OverflowError: (34, 'Result too large')
                            
                                python bind socket.error: [Errno 13] Permission denied
                            
                                How to mock psycopg2 cursor object?
                            
                                combining two string variables [duplicate]
                            
                                What's causing 'unable to connect to data source' for pyodbc?
                            
                                python regex: get end digits from a string
                            
                                Output to the same line overwriting previous
                            
                                Convert Python sequence to NumPy array, filling missing values
                            
                                Ignoring an error message to continue with the loop in python [duplicate]
                            
                                How does Python return multiple values from a function?
                            
                                Suggestions to plot overlapping lines in matplotlib?
                            
                                Why is python 3.6.1. not available in pyenv?
                            
                                python sqlalchemy label usage
                            
                                Know the depth of a dictionary
                            
                                ValueError: feature_names mismatch: in xgboost in the predict() function
                            
                                pandas pivot table to data frame [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df

Tags:

python

indexing

pandas

dataframe

Simon Sharp

People also ask

1 Answers

Craig

Recent Activity

Donate For Us