Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

series.unique vs list of set - performance

I have a pandas data frame with multiple columns. The aim is to find unique values in one of those columns.

Two ways to achieve this are:

  • Get a list of set of that series: list(set(data['Day']))

  • Get uniques using pandas's functions data['Day'].unique()

In my trials, set method works faster. Is this true for most cases? Why and why not? Any other resources utilization implication?

Please also add reasons to why either of them works better.

like image 258
Vikram Tiwari Avatar asked Oct 19 '17 21:10

Vikram Tiwari


Video Answer


2 Answers

It will depend on the data type. For numeric types, pd.unique should be significantly faster.

For strings, which are stored as python objects, there will be a much smaller difference, and set() will usually be competitive, as it is doing a very similar thing.

Some examples:

strs = np.repeat(np.array(['a', 'b', 'c'], dtype='O'), 10000)

In [11]: %timeit pd.unique(strs)
558 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [12]: %timeit list(set(strs))
531 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

nums = np.repeat(np.array([1, 2, 3]), 10000)

In [13]: %timeit pd.unique(nums)
230 µs ± 9.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [14]: %timeit list(set(nums))
2.16 ms ± 71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 159
chrisb Avatar answered Nov 03 '22 02:11

chrisb


It makes sense to use categorical dtype for the columns that have few unique values.

Demo:

 df = pd.DataFrame(np.random.choice(['aa','bbbb','c','ddddd','EeeeE','xxx'], 10**6), columns=['Day'])

In [34]: %timeit list(set(df['Day']))
98.1 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [35]: %timeit df['Day'].unique()
82.9 ms ± 56.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

almost the same timing for 1M rows

Let's test category dtype:

In [37]: df['cat'] = df['Day'].astype('category')

In [38]: %timeit list(set(df['cat']))
93.7 ms ± 766 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [39]: %timeit df['cat'].unique()
25.1 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

UPDATE: 500 unique values in 1.000.000 rows DF:

In [75]: a = pd.util.testing.rands_array(10, 500)

In [76]: df = pd.DataFrame({'Day':np.random.choice(a, 10**6)})

In [77]: df.shape
Out[77]: (1000000, 1)

In [78]: df.Day.nunique()
Out[78]: 500

In [79]: %timeit list(set(df['Day']))
55 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [80]: %timeit df['Day'].unique()
133 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [81]: df['cat'] = df['Day'].astype('category')

In [82]: %timeit list(set(df['cat']))
102 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [83]: %timeit df['cat'].unique()
38.3 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusion: it's always better to "timeit" on your real data - you might have different results...

like image 33
MaxU - stop WAR against UA Avatar answered Nov 03 '22 00:11

MaxU - stop WAR against UA