Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to call unique() on dask DataFrame

Tags:

pandas

dask

How do I call unique on a dask DataFrame ?

I get the following error if I try to call it the same way as for a regular pandas dataframe:

In [27]: len(np.unique(ddf[['col1','col2']].values))

AttributeError                            Traceback (most recent call last)
<ipython-input-27-34c0d3097aab> in <module>()
----> 1 len(np.unique(ddf[['col1','col2']].values))

/dir/anaconda2/lib/python2.7/site-packages/dask/dataframe/core.pyc in __getattr__(self, key)
1924             return self._constructor_sliced(merge(self.dask, dsk), name,
1925                                             meta, self.divisions)
-> 1926         raise AttributeError("'DataFrame' object has no attribute %r" % key)
1927
1928     def __dir__(self):

AttributeError: 'DataFrame' object has no attribute 'values'
like image 596
femibyte Avatar asked Nov 28 '16 15:11

femibyte


People also ask

What is unique in DataFrame?

unique() method is used when we deal with a single column of a DataFrame and returns all unique elements of a column. The method returns a DataFrame containing the unique elements of a column, along with their corresponding index labels. Syntax: Series. unique(self)

How do I read a DASK DataFrame?

You can inspect the content of the Dask DataFrame with the compute() method. This is quite similar to the syntax for reading CSV files into pandas DataFrames. The Dask DataFrame API was intentionally designed to look and feel just like the pandas API.

Is DASK faster than pandas?

Let's start with the simplest operation — read a single CSV file. To my surprise, we can already see a huge difference in the most basic operation. Datatable is 70% faster than pandas while dask is 500% faster!


1 Answers

For both Pandas and Dask.dataframe you should use the drop_duplicates method

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [1, 1, 2], 'y': [10, 10, 20]})

In [3]: df.drop_duplicates()
Out[3]: 
   x   y
0  1  10
2  2  20

In [4]: import dask.dataframe as dd

In [5]: ddf = dd.from_pandas(df, npartitions=2)

In [6]: ddf.drop_duplicates().compute()
Out[6]: 
   x   y
0  1  10
2  2  20
like image 171
MRocklin Avatar answered Oct 06 '22 16:10

MRocklin