I want to select rows from a dask dataframe based on a list of indices. How can I do that?
Example: Let's say, I have the following dask dataframe.
dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']}
pdf = pd.DataFrame(dict_)
pdf = pdf.set_index('index')
ddf = dask.dataframe.from_pandas(pdf, npartitions = 2)
Furthermore, I have a list of indices, that I am interested in, e.g.
indices_i_want_to_select = ['x1','x3', 'y6']
From this, I would like to generate a dask dataframe containing only the rows specified in indices_i_want_to_select
set_index syntaxCreate a pandas DataFrame with two columns of data, and a 2-partition Dask DataFrame from it. Print the DataFrame and see that it has one index column that was created by default by pandas and two columns with data. Take a look at the divisions of ddf. ddf has two divisions.
Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
Edit: dask now supports loc on lists:
ddf_selected = ddf.loc[indices_i_want_to_select]
The following should still work, but is not necessary anymore:
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)
#list of indices I want to select
l = ['i1', 4, 5]
#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
Using dask
version '1.2.0' results with an error due to the mixed index type.
in any case there is an option to use loc
.
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)
# #list of indices I want to select
l = ['i1', '4', '5']
# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With