I have following sample DataFrame d
consisting of two columns 'col1' and 'col2'. I would like to find the list of unique names for the whole DataFrame d.
d = {'col1':['Pat, Joseph',
'Tony, Hoffman',
'Miriam, Goodwin',
'Roxanne, Padilla',
'Julie, Davis',
'Muriel, Howell',
'Salvador, Reese',
'Kristopher, Mckenzie',
'Lucille, Thornton',
'Brenda, Wilkerson'],
'col2':['Kristopher, Mckenzie',
'Lucille, Thornton',
'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis',
'Muriel, Howell', 'Harriet, Phillips',
'Belinda, Drake;David, Ford', 'Jared, Cummings;Joanna, Burns;Bob, Cunningham',
'Keith, Hernandez;Pat, Joseph', 'Kristopher, Mckenzie', 'Lucille, Thornton']}
df = pd.DataFrame(data=d)
For column col1 i can get it done by using function unique().
df.col1.unique()
array(['Pat, Joseph', 'Tony, Hoffman', 'Miriam, Goodwin',
'Roxanne, Padilla', 'Julie, Davis', 'Muriel, Howell',
'Salvador, Reese', 'Kristopher, Mckenzie', 'Lucille, Thornton',
'Brenda, Wilkerson'], dtype=object)
len(df.col1) 10 # total number of rows len(df.col1.unique()) 9 # total number of unique rows
For col2 some of the rows have multiple names separated by a semicolon. e.g. 'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis'
.
How can I get the unique names from the col2 using vector operation? I am trying to avoid the for loop since the actual data set is large.
First split
by ;s\*
(regex - ;
with zero or more whitespaces) to DataFrame
, then reshape by stack
for Series
and last use unique
:
print (df['col2'].str.split(';\s*', expand=True).stack().unique())
['Kristopher, Mckenzie' 'Lucille, Thornton' 'Pete, Fitzgerald'
'Cecelia, Bass' 'Julie, Davis' 'Muriel, Howell' 'Harriet, Phillips'
'Belinda, Drake' 'David, Ford' 'Jared, Cummings' 'Joanna, Burns'
'Bob, Cunningham' 'Keith, Hernandez' 'Pat, Joseph']
Detail:
print (df['col2'].str.split(';\s*', expand=True))
0 1 2
0 Kristopher, Mckenzie None None
1 Lucille, Thornton None None
2 Pete, Fitzgerald Cecelia, Bass Julie, Davis
3 Muriel, Howell None None
4 Harriet, Phillips None None
5 Belinda, Drake David, Ford None
6 Jared, Cummings Joanna, Burns Bob, Cunningham
7 Keith, Hernandez Pat, Joseph None
8 Kristopher, Mckenzie None None
9 Lucille, Thornton None None
print (df['col2'].str.split(';\s*', expand=True).stack())
0 0 Kristopher, Mckenzie
1 0 Lucille, Thornton
2 0 Pete, Fitzgerald
1 Cecelia, Bass
2 Julie, Davis
3 0 Muriel, Howell
4 0 Harriet, Phillips
5 0 Belinda, Drake
1 David, Ford
6 0 Jared, Cummings
1 Joanna, Burns
2 Bob, Cunningham
7 0 Keith, Hernandez
1 Pat, Joseph
8 0 Kristopher, Mckenzie
9 0 Lucille, Thornton
dtype: object
Alternative solution:
print (np.unique(np.concatenate(df['col2'].str.split(';\s*').values)))
['Belinda, Drake' 'Bob, Cunningham' 'Cecelia, Bass' 'David, Ford'
'Harriet, Phillips' 'Jared, Cummings' 'Joanna, Burns' 'Julie, Davis'
'Keith, Hernandez' 'Kristopher, Mckenzie' 'Lucille, Thornton'
'Muriel, Howell' 'Pat, Joseph' 'Pete, Fitzgerald']
EDIT:
For all unique names add stack
first for Series
form all columns:
print (df.stack().str.split(';\s*', expand=True).stack().unique())
['Pat, Joseph' 'Kristopher, Mckenzie' 'Tony, Hoffman' 'Lucille, Thornton'
'Miriam, Goodwin' 'Pete, Fitzgerald' 'Cecelia, Bass' 'Julie, Davis'
'Roxanne, Padilla' 'Muriel, Howell' 'Harriet, Phillips' 'Belinda, Drake'
'David, Ford' 'Salvador, Reese' 'Jared, Cummings' 'Joanna, Burns'
'Bob, Cunningham' 'Keith, Hernandez' 'Brenda, Wilkerson']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With