How do I get the unique values of a column of lists in pandas or numpy such that second column from
would result in 'action', 'crime', 'drama'
.
The closest (but non-functional) solutions I could come up with were:
genres = data['Genre'].unique()
But this predictably results in a TypeError saying how lists aren't hashable.
TypeError: unhashable type: 'list'
Set seemed to be a good idea but
genres = data.apply(set(), columns=['Genre'], axis=1)
but also results in a
TypeError: set() takes no keyword arguments
You can use explode
:
data = pd.DataFrame([
{
"title": "The Godfather: Part II",
"genres": ["crime", "drama"],
"director": "Fracis Ford Coppola"
},
{
"title": "The Dark Knight",
"genres": ["action", "crime", "drama"],
"director": "Christopher Nolan"
}
])
# Changed from data.explode("genres")["genres"].unique() as suggested by rafaelc
data["genres"].explode().unique()
Results in:
array(['crime', 'drama', 'action'], dtype=object)
If you only want to find the unique values, I'd recommend using itertools.chain.from_iterable
to concatenate all those lists
import itertools
>>> np.unique([*itertools.chain.from_iterable(df.Genre)])
array(['action', 'crime', 'drama'], dtype='<U6')
Or even faster
>>> set(itertools.chain.from_iterable(df.Genre))
{'action', 'crime', 'drama'}
Timings
df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
df = pd.concat([df]*10000)
%timeit set(itertools.chain.from_iterable(df.Genre))
100 loops, best of 3: 2.55 ms per loo
%timeit set([x for y in df['Genre'] for x in y])
100 loops, best of 3: 4.09 ms per loop
%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])
100 loops, best of 3: 12.8 ms per loop
%timeit np.unique(df['Genre'].sum())
1 loop, best of 3: 1.65 s per loop
%timeit set(df['Genre'].sum())
1 loop, best of 3: 1.66 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With