Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I select data from a dask dataframe by a list of indices?

I want to select rows from a dask dataframe based on a list of indices. How can I do that?

Example: Let's say, I have the following dask dataframe.

dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']}
pdf = pd.DataFrame(dict_)
pdf = pdf.set_index('index')
ddf = dask.dataframe.from_pandas(pdf, npartitions = 2)

Furthermore, I have a list of indices, that I am interested in, e.g.

indices_i_want_to_select = ['x1','x3', 'y6']

From this, I would like to generate a dask dataframe containing only the rows specified in indices_i_want_to_select

like image 464
Arco Bast Avatar asked Jul 12 '16 00:07

Arco Bast


People also ask

How do I index a Dask DataFrame?

set_index syntaxCreate a pandas DataFrame with two columns of data, and a 2-partition Dask DataFrame from it. Print the DataFrame and see that it has one index column that was created by default by pandas and two columns with data. Take a look at the divisions of ddf. ddf has two divisions.

How do I select columns in Dask DataFrame?

Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns.

Is Dask DataFrame faster than pandas?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.


2 Answers

Edit: dask now supports loc on lists:

ddf_selected = ddf.loc[indices_i_want_to_select]

The following should still work, but is not necessary anymore:

import pandas as pd
import dask.dataframe as dd

#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)

#list of indices I want to select
l = ['i1', 4, 5]

#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
like image 147
Arco Bast Avatar answered Nov 15 '22 06:11

Arco Bast


Using dask version '1.2.0' results with an error due to the mixed index type. in any case there is an option to use loc.

import pandas as pd
import dask.dataframe as dd

#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)

# #list of indices I want to select
l = ['i1', '4', '5']

# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()
like image 1
skibee Avatar answered Nov 15 '22 06:11

skibee