Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dask df.col.unique() vs df.col.drop_duplicates()

Tags:

dask

In dask what is the difference between

df.col.unique()

and

df.col.drop_duplicates()

Both return a series containing the unique elements of df.col. There is a difference in the index, unique result is indexed by 1..N while drop_duplicates indexed by an arbitrary looking sequence of numbers.

What is the significance of the index returned by drop_duplicates?

Is there any reason to use one over the other if the index is not important?

like image 996
Daniel Mahler Avatar asked Mar 07 '16 06:03

Daniel Mahler


People also ask

Is Dask faster than pandas?

Let's start with the simplest operation — read a single CSV file. To my surprise, we can already see a huge difference in the most basic operation. Datatable is 70% faster than pandas while dask is 500% faster! The outcomes are all sorts of DataFrame objects which have very identical interfaces.

What is Npartition in Dask?

The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.

How do I convert panda to Dask?

We can use Dask's from_pandas function for this conversion. This function splits the in-memory pandas DataFrame into multiple sections and creates a Dask DataFrame. We can then operate on the Dask DataFrame in parallel using its pandas-like interface.


1 Answers

Dask.dataframe has both because Pandas has both, and dask.dataframe mostly copies the Pandas API. Unique is a holdover from Pandas' history with Numpy.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [1, 2, 1], 'y': [1., 2., 3.]}, index=pd.Index(['a', 'b', 'A'], name='I'))

In [3]: df.x.drop_duplicates()
Out[3]: 
I
a    1
b    2
Name: x, dtype: int64

In [4]: df.x.unique()
Out[4]: array([1, 2])

In dask.dataframe we deviate slightly and choose to use a dask.dataframe.Series rather than a dask.array.Array because one can't precompute the length of the array and so can't act lazily.

In practice there is little reason to use unique over drop_duplicates

like image 100
MRocklin Avatar answered Nov 22 '22 22:11

MRocklin