>>> pd.DataFrame([1], index=['1']).loc['2'] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['1','2']] # Succeeds, as in the answer below.
I'd like something that doesn't fail in either of
>>> pd.DataFrame([1], index=['1']).loc['2'] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']] # KeyError
Is there a function like loc
which gracefully handles this, or some other way of expressing this query?
Update for @AlexLenail comment
It's a fair point that this will be slow for large lists. I did a little bit of more digging and found that the intersection
method is available for Indexes
and columns. I'm not sure about the algorithmic complexity but it's much faster empirically.
You can do something like this.
good_keys = df.index.intersection(all_keys)
df.loc[good_keys]
Or like your example
df = pd.DataFrame([1], index=['1'])
df.loc[df.index.intersection(['2'])]
Here is a little experiment below
n = 100000
# Create random values and random string indexes
# have the bad indexes contain extra values not in DataFrame Index
rand_val = np.random.rand(n)
rand_idx = []
for x in range(n):
rand_idx.append(str(x))
bad_idx = []
for x in range(n*2):
bad_idx.append(str(x))
df = pd.DataFrame(rand_val, index=rand_idx)
df.head()
def get_valid_keys_list_comp():
# Return filtered DataFrame using list comprehension to filter keys
vkeys = [key for key in bad_idx if key in df.index.values]
return df.loc[vkeys]
def get_valid_keys_intersection():
# Return filtered DataFrame using list intersection() to filter keys
vkeys = df.index.intersection(bad_idx)
return df.loc[vkeys]
%%timeit
get_valid_keys_intersection()
# 64.5 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
get_valid_keys_list_comp()
# 6.14 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original answer
I'm not sure if pandas has a built-in function to handle this but you can use Python list comprehension to filter to valid indexes with something like this.
Given a DataFrame df2
A B C D F
test 1.0 2013-01-02 1.0 3 foo
train 1.0 2013-01-02 1.0 3 foo
test 1.0 2013-01-02 1.0 3 foo
train 1.0 2013-01-02 1.0 3 foo
You can filter your index query with this
keys = ['test', 'train', 'try', 'fake', 'broken']
valid_keys = [key for key in keys if key in df2.index.values]
df2.loc[valid_keys]
This will also work for columns if you use df2.columns
instead of df2.index.values
I found an alternative (provided a check for df.empty is made beforehand). You could do something like this
df[df.index=='2'] -> returns either a dataframe with matched values or empty dataframe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With