Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: how to get the unique values of a column that contains a list of values?

Tags:

python

pandas

Consider the following dataframe

df = pd.DataFrame({'name' : [['one two','three four'], ['one'],[], [],['one two'],['three']],
                   'col' : ['A','B','A','B','A','B']})       
df.sort_values(by='col',inplace=True)

df
Out[62]: 
  col                   name
0   A  [one two, three four]
2   A                     []
4   A              [one two]
1   B                  [one]
3   B                     []
5   B                [three]

I would like to get a column that keeps track of all the unique strings included in name for each combination of col.

That is, the expected output is

df
Out[62]: 
  col                   name    unique_list
0   A  [one two, three four]    [one two, three four]
2   A                     []    [one two, three four]
4   A              [one two]    [one two, three four]
1   B                  [one]    [one, three]
3   B                     []    [one, three]
5   B                [three]    [one, three]

Indeed, say for group A, you can see that the unique set of strings included in [one two, three four], [] and [one two] is [one two]

I can obtain the corresponding number of unique values using Pandas : how to get the unique number of values in cells when cells contain lists? :

df['count_unique']=df.groupby('col')['name'].transform(lambda x: list(pd.Series(x.apply(pd.Series).stack().reset_index(drop=True, level=1).nunique())))


df
Out[65]: 
  col                   name count_unique
0   A  [one two, three four]            2
2   A                     []            2
4   A              [one two]            2
1   B                  [one]            2
3   B                     []            2
5   B                [three]            2

but replacing nunique with unique above fails.

Any ideas? Thanks!

like image 354
ℕʘʘḆḽḘ Avatar asked Sep 14 '16 21:09

ℕʘʘḆḽḘ


2 Answers

Here is the solution

df['unique_list'] = df.col.map(df.groupby('col')['name'].sum().apply(np.unique))
    df

enter image description here

like image 155
piRSquared Avatar answered Nov 15 '22 17:11

piRSquared


Try:

uniq_df = df.groupby('col')['name'].apply(lambda x: list(set(reduce(lambda y,z: y+z,x)))).reset_index()
uniq_df.columns = ['col','uniq_list']
pd.merge(df,uniq_df, on='col', how='left')

Desired output:

  col                   name              uniq_list
0   A  [one two, three four]  [one two, three four]
1   A                     []  [one two, three four]
2   A              [one two]  [one two, three four]
3   B                  [one]           [three, one]
4   B                     []           [three, one]
5   B                [three]           [three, one]
like image 33
Abdou Avatar answered Nov 15 '22 17:11

Abdou