Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add UUID's to pandas DF

Say I have a pandas DataFrame like so:

df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
        Name
    0   John Doe
    1   Jane Smith
    2   John Doe
    3   Jane Smith
    4   Jack Dawson
    5   John Doe

And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:

df:
            Name        UUID
        0   John Doe    6d07cb5f-7faa-4893-9bad-d85d3c192f52
        1   Jane Smith  a709bd1a-5f98-4d29-81a8-09de6e675b56
        2   John Doe    6d07cb5f-7faa-4893-9bad-d85d3c192f52
        3   Jane Smith  a709bd1a-5f98-4d29-81a8-09de6e675b56
        4   Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
        5   John Doe    6d07cb5f-7faa-4893-9bad-d85d3c192f52

The uuid's should be generated from the uuid.uuid4() function.

My current idea is to use a groupby("Name").cumcount() to identify which rows have the same name and which are different. Then I'd create a dictionary with a key of the cumcount and a value of the uuid and use that to add the uuids to the DF.

While that would work, I'm wondering if there's a more efficient way to do this?

like image 583
unpairestgood Avatar asked Aug 04 '16 16:08

unpairestgood


2 Answers

Grouping the data frame and applying uuid.uuid4 will be more efficient than looping through the groups. Since you want to keep the original shape of your data frame you should use pandas function transform.

Using your sample data frame, we'll add a column in order to have a series to apply transform to. Since uuid.uuid4 doesn't take any argument it really doesn't matter what the column is.

df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df.loc[:, "UUID"] = 1

Now to use transform:

import uuid
df.loc[:, "UUID"] = df.groupby("Name").UUID.transform(lambda g: uuid.uuid4())

+----+--------------+--------------------------------------+
|    |    Name      |                 UUID                 |
+----+--------------+--------------------------------------+
| 0  | John Doe     | c032c629-b565-4903-be5c-81bf05804717 |
| 1  | Jane Smith   | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 2  | John Doe     | c032c629-b565-4903-be5c-81bf05804717 |
| 3  | Jane Smith   | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 4  | Jack Dawson  | 6b843d0f-ba3a-4880-8a84-d98c4af09cc3 |
| 5  | John Doe     | c032c629-b565-4903-be5c-81bf05804717 |
+----+--------------+--------------------------------------+

uuid.uuid4 will be called as many times as there are distinct groups

like image 131
MaFF Avatar answered Oct 16 '22 08:10

MaFF


How about this

names = df['Name'].unique()
for name in names:
    df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()

could shorten it to

for name in df['Name'].unique():
    df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
like image 24
SO44 Avatar answered Oct 16 '22 09:10

SO44