How can I select data from a dask dataframe by a list of indices?

Tags:

I want to select rows from a dask dataframe based on a list of indices. How can I do that?

Example: Let's say, I have the following dask dataframe.

dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']}
pdf = pd.DataFrame(dict_)
pdf = pdf.set_index('index')
ddf = dask.dataframe.from_pandas(pdf, npartitions = 2)

Furthermore, I have a list of indices, that I am interested in, e.g.

indices_i_want_to_select = ['x1','x3', 'y6']

From this, I would like to generate a dask dataframe containing only the rows specified in indices_i_want_to_select

464

asked Jul 12 '16 00:07

Arco Bast

2 Answers

Edit: dask now supports loc on lists:

ddf_selected = ddf.loc[indices_i_want_to_select]

The following should still work, but is not necessary anymore:

import pandas as pd
import dask.dataframe as dd

#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)

#list of indices I want to select
l = ['i1', 4, 5]

#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)

147

answered Nov 15 '22 06:11

Arco Bast

Using dask version '1.2.0' results with an error due to the mixed index type. in any case there is an option to use loc.

import pandas as pd
import dask.dataframe as dd

#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)

# #list of indices I want to select
l = ['i1', '4', '5']

# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()

answered Nov 15 '22 06:11

skibee

Related questions
                            
                                Print OLS regression summary to text file
                            
                                How to map integers to colors in matplotlib? [duplicate]
                            
                                Matplotlib in Pyside with Qt designer (PySide)
                            
                                RStudio Python Version Change on Mac
                            
                                A value is trying to be set on a copy of a slice from a DataFrame
                            
                                How to determine the cause for "BUS-Error"
                            
                                Catch errors in asyncio.ensure_future
                            
                                Lambda use case confusion
                            
                                map pandas Dataframe columns to dictionary values
                            
                                pandas replace NaN with NaT
                            
                                Recursively copying Content from one path to another of s3 buckets using boto in python
                            
                                Multidimensional/multivariate dynamic time warping (DTW) library/code in Python
                            
                                Command help (via -h) where `argparse` is range checking input port number
                            
                                Custom xticks for multiple subplots?
                            
                                How can I list all packages/modules available to Python from within a Python script?
                            
                                How to rename DynamoDB column/key
                            
                                Why is Django returning stale cache data?
                            
                                How to remove unicode characters from Dictionary data in python
                            
                                Regular expression to separate out the last occurring number using Python
                            
                                Separating Django installed apps between Development vs Production

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I select data from a dask dataframe by a list of indices?

Tags:

python

indexing

dask

Arco Bast

People also ask

2 Answers

Arco Bast

skibee

Recent Activity

Donate For Us