Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter rows of python datatable based on whether it is in a list

I am new to working with python datatables and here is the tutorial I am following

How do I filter out the rows where the values in a certain column are contained in a list?

Essentially this is the code I am working with:

    import datatable as dt
    sfr = dt.fread(os.path.join(dirName, 'Results.csv'))

sfr
Out[25]: 
         |       ioid  itemtype  date       itemid           tid  value
-------- + ----------  --------  --------  -------  ------------  -------
       0 |          1         1  7-1-2022     9015           531   0.0283
       1 |          1         1  7-1-2022     9015           532   0.0071
       2 |          1         1  7-1-2022     9016           534   0.0065
       3 |          1         1  7-1-2022     9017          1018   0.0005

I am trying to do the following

ids = [9016, 9017]
sft[dt.f.itemid.isin(ids)]

However, I am unable to make the syntax work.

Desired output is:

         |       ioid  itemtype  date       itemid           tid  value
-------- + ----------  --------  --------  -------  ------------  -------
       2 |          1         1  7-1-2022     9016           534   0.0065
       3 |          1         1  7-1-2022     9017          1018   0.0005
like image 370
Zanam Avatar asked May 30 '21 00:05

Zanam


People also ask

How do I select a row based on a list?

Select Pandas Rows Based on List Index. You can use loc[] or iloc[] with your desired indices to select rows from DataFrame. For more details refer to Using Index with loc[] to select rows from DataFrame.


1 Answers

Because Pydatatable doesn't explicitly support filtering against the list of values (see feature request), the following solution may appear not quite intuitive. Still, it does exactly what such filtering would do by utilizing datatable join function:

ids = [9016, 9017]
f = dt.Frame(itemid = ids)
sfr.key = "itemid"
sfr_filtered = f[:, :, dt.join(sfr)]

These steps consist of:

  1. creating a frame (datable) that contains single column with the same name as the column being filtered and storing list of values inside this frame
  2. keying original frame on the column being filtered
  3. joining frames to perform filtering using datatable join

Here is a catch though: currently join has limitations - it's only left outer join and the join column (itemid in sfr in the solution above) must have unique values, otherwise sfr.key = "itemid" throws this error:

ValueError: Cannot set a key: the values are not unique

Because of these strong assumptions it won't work when itemid contains not unique values. In that case we turn join around and add additional filter on a dummy column added to the frame containing list:

ids = [9016, 9017]
f = dt.Frame(itemid = ids, dummy = [0, 0])
f.key = "itemid"
sfr_filtered = sfr[:, :, dt.join(f)][~dt.isna(dt.f.dummy), :]

This solution will always work but obviously less efficient due to always joining all rows and extra filtering to remove unmatched rows. For more details on current state of Pydatatable join you can find docs here.

UPDATE

For completeness I added another (arguably more straight forward) solution borrowed from this answer by Pasha (the creator and maintainer of Pydatatable):

import functools
import operator

filter = functools.reduce(operator.or_, (dt.f.itemid == id for id in ids))
sfr[filter, :]

Until this feature request https://github.com/h2oai/datatable/issues/699 is implemented one of the solutions shown above does filtering against list of values.

like image 67
topchef Avatar answered Oct 22 '22 04:10

topchef