I am new to working with python datatables and here is the tutorial I am following
How do I filter out the rows where the values in a certain column are contained in a list?
Essentially this is the code I am working with:
import datatable as dt
sfr = dt.fread(os.path.join(dirName, 'Results.csv'))
sfr
Out[25]:
| ioid itemtype date itemid tid value
-------- + ---------- -------- -------- ------- ------------ -------
0 | 1 1 7-1-2022 9015 531 0.0283
1 | 1 1 7-1-2022 9015 532 0.0071
2 | 1 1 7-1-2022 9016 534 0.0065
3 | 1 1 7-1-2022 9017 1018 0.0005
I am trying to do the following
ids = [9016, 9017]
sft[dt.f.itemid.isin(ids)]
However, I am unable to make the syntax work.
Desired output is:
| ioid itemtype date itemid tid value
-------- + ---------- -------- -------- ------- ------------ -------
2 | 1 1 7-1-2022 9016 534 0.0065
3 | 1 1 7-1-2022 9017 1018 0.0005
Select Pandas Rows Based on List Index. You can use loc[] or iloc[] with your desired indices to select rows from DataFrame. For more details refer to Using Index with loc[] to select rows from DataFrame.
Because Pydatatable doesn't explicitly support filtering against the list of values (see feature request), the following solution may appear not quite intuitive. Still, it does exactly what such filtering would do by utilizing datatable join function:
ids = [9016, 9017]
f = dt.Frame(itemid = ids)
sfr.key = "itemid"
sfr_filtered = f[:, :, dt.join(sfr)]
These steps consist of:
Here is a catch though: currently join has limitations - it's only left outer join and the join column (itemid
in sfr
in the solution above) must have unique values, otherwise sfr.key = "itemid"
throws this error:
ValueError: Cannot set a key: the values are not unique
Because of these strong assumptions it won't work when itemid
contains not unique values. In that case we turn join around and add additional filter on a dummy column added to the frame containing list:
ids = [9016, 9017]
f = dt.Frame(itemid = ids, dummy = [0, 0])
f.key = "itemid"
sfr_filtered = sfr[:, :, dt.join(f)][~dt.isna(dt.f.dummy), :]
This solution will always work but obviously less efficient due to always joining all rows and extra filtering to remove unmatched rows. For more details on current state of Pydatatable join you can find docs here.
UPDATE
For completeness I added another (arguably more straight forward) solution borrowed from this answer by Pasha (the creator and maintainer of Pydatatable):
import functools
import operator
filter = functools.reduce(operator.or_, (dt.f.itemid == id for id in ids))
sfr[filter, :]
Until this feature request https://github.com/h2oai/datatable/issues/699 is implemented one of the solutions shown above does filtering against list of values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With