In dask what is the difference between
df.col.unique()
and
df.col.drop_duplicates()
Both return a series containing the unique elements of df.col
.
There is a difference in the index, unique
result is indexed by 1..N while drop_duplicates
indexed by an arbitrary looking sequence of numbers.
What is the significance of the index returned by drop_duplicates
?
Is there any reason to use one over the other if the index is not important?
Let's start with the simplest operation — read a single CSV file. To my surprise, we can already see a huge difference in the most basic operation. Datatable is 70% faster than pandas while dask is 500% faster! The outcomes are all sorts of DataFrame objects which have very identical interfaces.
The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.
We can use Dask's from_pandas function for this conversion. This function splits the in-memory pandas DataFrame into multiple sections and creates a Dask DataFrame. We can then operate on the Dask DataFrame in parallel using its pandas-like interface.
Dask.dataframe has both because Pandas has both, and dask.dataframe mostly copies the Pandas API. Unique is a holdover from Pandas' history with Numpy.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 1], 'y': [1., 2., 3.]}, index=pd.Index(['a', 'b', 'A'], name='I'))
In [3]: df.x.drop_duplicates()
Out[3]:
I
a 1
b 2
Name: x, dtype: int64
In [4]: df.x.unique()
Out[4]: array([1, 2])
In dask.dataframe we deviate slightly and choose to use a dask.dataframe.Series
rather than a dask.array.Array
because one can't precompute the length of the array and so can't act lazily.
In practice there is little reason to use unique
over drop_duplicates
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With