Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df

I'd like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here's an example of the df:

      FirstName LastName  id 0     Tom       Jones     1 1     Tom       Jones     1 2     David     Smith     1 3     Alex      Thompson  1 4     Alex      Thompson  1 

So, basically I want all entries for Tom Jones to have id=1, all entries for David Smith to have id=2, all entries for Alex Thompson to have id=3, and so on.

So I already have one solution, which is a dead simple python loop iterating two values (One for id, one for index) and assigning the individual an id based on whether they match the previous individual:

x = 1 i = 1  while i < len(df_test):     if (df_test.LastName[i] == df_test.LastName[i-1]) &      (df_test.FirstName[i] == df_test.FirstName[i-1]):         df_test.loc[i, 'id'] = x         i = i+1     else:         x = x+1         df_test.loc[i, 'id'] = x         i = i+1 

The problem I'm running into is that the DataFrame has about 9 million entries, so with that loop it would have taken a huge amount of time to run. Can anyone think of a more efficient way to do this? I've been looking at groupby and multiindexing as potential solutions, but haven't quite found the right solution yet.

like image 998
Simon Sharp Avatar asked Aug 15 '17 01:08

Simon Sharp


People also ask

How extract unique values from multiple columns in Pandas?

Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.

How do you get unique names on Pandas?

2. pandas Get Unique Values in Column. Unique is also referred to as distinct, you can get unique values in the column using pandas Series. unique() function, since this function needs to call on the Series object, use df['column_name'] to get the unique values as a Series.

How do you create a unique ID in Python?

uuid1() is defined in UUID library and helps to generate the random id using MAC address and time component. bytes : Returns id in form of 16 byte string. int : Returns id in form of 128-bit integer. hex : Returns random id as 32 character hexadecimal string.


1 Answers

This approach uses .groupby() and .ngroup() (new in Pandas 0.20.2) to create the id column:

df['id'] = df.groupby(['LastName','FirstName']).ngroup() >>> df     First    Second  id 0    Tom     Jones   0 1    Tom     Jones   0 2  David     Smith   1 3   Alex  Thompson   2 4   Alex  Thompson   2 

I checked timings and, for the small dataset in this example, Alexander's answer is faster:

%timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes) 1000 loops, best of 3: 848 µs per loop  %timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup()) 1000 loops, best of 3: 1.22 ms per loop 

However, for larger dataframes, the groupby() approach appears to be faster. To create a large, representative data set, I used faker to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.

import faker fakenames = faker.Faker() first = [ fakenames.first_name() for _ in range(5000) ] last = [ fakenames.last_name() for _ in range(5000) ] df2 = pd.DataFrame({'FirstName':first, 'LastName':last}) df2 = pd.concat([df2, df2.iloc[:2000]]) 

Running the timing on this larger data set gives:

%timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes) 100 loops, best of 3: 5.22 ms per loop  %timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup()) 100 loops, best of 3: 3.1 ms per loop 

You may want to test both approaches on your data set to determine which one works best given the size of your data.

like image 72
Craig Avatar answered Sep 30 '22 21:09

Craig