I have a pandas data frame with multiple columns. The aim is to find unique values in one of those columns.
Two ways to achieve this are:
Get a list of set of that series: list(set(data['Day']))
Get uniques using pandas's functions data['Day'].unique()
In my trials, set
method works faster. Is this true for most cases? Why and why not? Any other resources utilization implication?
Please also add reasons to why either of them works better.
It will depend on the data type. For numeric types, pd.unique
should be significantly faster.
For strings, which are stored as python objects, there will be a much smaller difference, and set()
will usually be competitive, as it is doing a very similar thing.
Some examples:
strs = np.repeat(np.array(['a', 'b', 'c'], dtype='O'), 10000)
In [11]: %timeit pd.unique(strs)
558 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [12]: %timeit list(set(strs))
531 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
nums = np.repeat(np.array([1, 2, 3]), 10000)
In [13]: %timeit pd.unique(nums)
230 µs ± 9.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [14]: %timeit list(set(nums))
2.16 ms ± 71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It makes sense to use categorical dtype for the columns that have few unique values.
Demo:
df = pd.DataFrame(np.random.choice(['aa','bbbb','c','ddddd','EeeeE','xxx'], 10**6), columns=['Day'])
In [34]: %timeit list(set(df['Day']))
98.1 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [35]: %timeit df['Day'].unique()
82.9 ms ± 56.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
almost the same timing for 1M rows
Let's test category dtype:
In [37]: df['cat'] = df['Day'].astype('category')
In [38]: %timeit list(set(df['cat']))
93.7 ms ± 766 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [39]: %timeit df['cat'].unique()
25.1 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
UPDATE: 500 unique values in 1.000.000 rows DF:
In [75]: a = pd.util.testing.rands_array(10, 500)
In [76]: df = pd.DataFrame({'Day':np.random.choice(a, 10**6)})
In [77]: df.shape
Out[77]: (1000000, 1)
In [78]: df.Day.nunique()
Out[78]: 500
In [79]: %timeit list(set(df['Day']))
55 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [80]: %timeit df['Day'].unique()
133 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [81]: df['cat'] = df['Day'].astype('category')
In [82]: %timeit list(set(df['cat']))
102 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [83]: %timeit df['cat'].unique()
38.3 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conclusion: it's always better to "timeit" on your real data - you might have different results...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With