In dask what is the difference between <pre class="prettyprint"><code>df.col.unique() </code></pre> and <pre class="prettyprint"><code>df.col.drop_duplicates() </code></pre> Both return a series containing the unique elements of <code>df.col</code>. There is a difference in the index, <code>unique</code> result is indexed by 1..N while <code>drop_duplicates</code> indexed by an arbitrary looking sequence of numbers. What is the significance of the index returned by <code>drop_duplicates</code>? Is there any reason to use one over the other if the index is not important?

Dask.dataframe has both because Pandas has both, and dask.dataframe mostly copies the Pandas API. Unique is a holdover from Pandas' history with Numpy. <pre class="prettyprint"><code>In [1]: import pandas as pd In [2]: df = pd.DataFrame({'x': [1, 2, 1], 'y': [1., 2., 3.]}, index=pd.Index(['a', 'b', 'A'], name='I')) In [3]: df.x.drop_duplicates() Out[3]: I a 1 b 2 Name: x, dtype: int64 In [4]: df.x.unique() Out[4]: array([1, 2]) </code></pre> In dask.dataframe we deviate slightly and choose to use a <code>dask.dataframe.Series</code> rather than a <code>dask.array.Array</code> because one can't precompute the length of the array and so can't act lazily. In practice there is little reason to use <code>unique</code> over <code>drop_duplicates</code>

dask df.col.unique() vs df.col.drop_duplicates()

Tags:

dask

In dask what is the difference between

df.col.unique()

and

df.col.drop_duplicates()

Both return a series containing the unique elements of df.col. There is a difference in the index, unique result is indexed by 1..N while drop_duplicates indexed by an arbitrary looking sequence of numbers.

What is the significance of the index returned by drop_duplicates?

Is there any reason to use one over the other if the index is not important?

996

asked Mar 07 '16 06:03

Daniel Mahler

1 Answers

Dask.dataframe has both because Pandas has both, and dask.dataframe mostly copies the Pandas API. Unique is a holdover from Pandas' history with Numpy.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [1, 2, 1], 'y': [1., 2., 3.]}, index=pd.Index(['a', 'b', 'A'], name='I'))

In [3]: df.x.drop_duplicates()
Out[3]: 
I
a    1
b    2
Name: x, dtype: int64

In [4]: df.x.unique()
Out[4]: array([1, 2])

In dask.dataframe we deviate slightly and choose to use a dask.dataframe.Series rather than a dask.array.Array because one can't precompute the length of the array and so can't act lazily.

In practice there is little reason to use unique over drop_duplicates

100

answered Nov 22 '22 22:11

MRocklin

Related questions
                            
                                Dask fails with freeze_support bug
                            
                                Dask, create a dataframe from several dask arrays
                            
                                dask dataframe head() returns empty df
                            
                                Choosing a framework for larger than memory data analysis with python
                            
                                Python PANDAS: Converting from pandas/numpy to dask dataframe/array
                            
                                Is there a way to get the nlargest items per group in dask?
                            
                                How to apply funtion to single Column of large dataset using Dask?
                            
                                Where is the pydata BLAZE project heading?
                            
                                Does Dask support functions with multiple outputs in Custom Graphs?
                            
                                How to parallelize groupby() in dask?
                            
                                Slicing a Dask Dataframe
                            
                                Fastest way to get the minimum value of data array in another paired bin array
                            
                                Howto copy a dask dataframe?
                            
                                Dask "no module named xxxx" error
                            
                                Dask delayed object of unspecified length not iterable error when combining dictionaries
                            
                                Dask For Loop In Parallel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With