series.unique vs list of set - performance

Question

I have a pandas data frame with multiple columns. The aim is to find unique values in one of those columns.

Two ways to achieve this are:

Get a list of set of that series: list(set(data['Day']))
Get uniques using pandas's functions data['Day'].unique()

In my trials, set method works faster. Is this true for most cases? Why and why not? Any other resources utilization implication?

Please also add reasons to why either of them works better.

chrisb · Accepted Answer

It will depend on the data type. For numeric types, pd.unique should be significantly faster.

For strings, which are stored as python objects, there will be a much smaller difference, and set() will usually be competitive, as it is doing a very similar thing.

Some examples:

strs = np.repeat(np.array(['a', 'b', 'c'], dtype='O'), 10000)

In [11]: %timeit pd.unique(strs)
558 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [12]: %timeit list(set(strs))
531 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

nums = np.repeat(np.array([1, 2, 3]), 10000)

In [13]: %timeit pd.unique(nums)
230 µs ± 9.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [14]: %timeit list(set(nums))
2.16 ms ± 71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

MaxU - stop WAR against UA · Answer

It makes sense to use categorical dtype for the columns that have few unique values.

Demo:

 df = pd.DataFrame(np.random.choice(['aa','bbbb','c','ddddd','EeeeE','xxx'], 10**6), columns=['Day'])

In [34]: %timeit list(set(df['Day']))
98.1 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [35]: %timeit df['Day'].unique()
82.9 ms ± 56.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

almost the same timing for 1M rows

Let's test category dtype:

In [37]: df['cat'] = df['Day'].astype('category')

In [38]: %timeit list(set(df['cat']))
93.7 ms ± 766 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [39]: %timeit df['cat'].unique()
25.1 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

UPDATE: 500 unique values in 1.000.000 rows DF:

In [75]: a = pd.util.testing.rands_array(10, 500)

In [76]: df = pd.DataFrame({'Day':np.random.choice(a, 10**6)})

In [77]: df.shape
Out[77]: (1000000, 1)

In [78]: df.Day.nunique()
Out[78]: 500

In [79]: %timeit list(set(df['Day']))
55 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [80]: %timeit df['Day'].unique()
133 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [81]: df['cat'] = df['Day'].astype('category')

In [82]: %timeit list(set(df['cat']))
102 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [83]: %timeit df['cat'].unique()
38.3 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusion: it's always better to "timeit" on your real data - you might have different results...

series.unique vs list of set - performance

Tags:

python

pandas

unique

Vikram Tiwari

Video Answer

2 Answers

chrisb

MaxU - stop WAR against UA

Recent Activity

Donate For Us

series.unique vs list of set - performance

Tags:

python

pandas

unique

Vikram Tiwari

Video Answer

2 Answers

chrisb

MaxU - stop WAR against UA

Related questions

Recent Activity

Donate For Us