Say I have a pandas DataFrame like so:
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
Name
0 John Doe
1 Jane Smith
2 John Doe
3 Jane Smith
4 Jack Dawson
5 John Doe
And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:
df:
Name UUID
0 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
1 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
2 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
3 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
4 Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
5 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
The uuid's should be generated from the uuid.uuid4() function.
My current idea is to use a groupby("Name").cumcount() to identify which rows have the same name and which are different. Then I'd create a dictionary with a key of the cumcount and a value of the uuid and use that to add the uuids to the DF.
While that would work, I'm wondering if there's a more efficient way to do this?
Grouping the data frame and applying uuid.uuid4 will be more efficient than looping through the groups. Since you want to keep the original shape of your data frame you should use pandas
function transform
.
Using your sample data frame, we'll add a column in order to have a series to apply transform
to. Since uuid.uuid4
doesn't take any argument it really doesn't matter what the column is.
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df.loc[:, "UUID"] = 1
Now to use transform
:
import uuid
df.loc[:, "UUID"] = df.groupby("Name").UUID.transform(lambda g: uuid.uuid4())
+----+--------------+--------------------------------------+
| | Name | UUID |
+----+--------------+--------------------------------------+
| 0 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 1 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 2 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 3 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 4 | Jack Dawson | 6b843d0f-ba3a-4880-8a84-d98c4af09cc3 |
| 5 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
+----+--------------+--------------------------------------+
uuid.uuid4
will be called as many times as there are distinct groups
How about this
names = df['Name'].unique()
for name in names:
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
could shorten it to
for name in df['Name'].unique():
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With