I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'a': ['abc', 'r00001', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})
a b
0 abc 0
1 r00001 1
2 r00010 2
3 rfoo 3
4 r01234 4
5 r1234 5
I now want to select all columns of this dataframe where the entries in column a
start with r
followed by five numbers.
From here I learned how one would do this if it started just with r
without the numbers:
print df.loc[df['a'].str.startswith('r'), :]
a b
1 r00001 1
2 r00010 2
3 rfoo 3
4 r01234 4
5 r1234 5
Something like this
print df.loc[df['a'].str.startswith(r'[r]\d{5}'), :]
does of course not work. How would one do this properly?
Option 1pd.Series.str.match
df.a.str.match('^r\d{5}$')
1 True
2 True
3 False
4 True
5 False
Name: a, dtype: bool
Use it as a filter
df[df.a.str.match('^r\d{5}$')]
a b
1 r00001 1
2 r00010 2
4 r01234 4
Option 2
Custom list comprehension using string methods
f = lambda s: s.startswith('r') and (len(s) == 6) and s[1:].isdigit()
[f(s) for s in df.a.values.tolist()]
[False, True, True, False, True, False]
Use it as a filter
df[[f(s) for s in df.a.values.tolist()]]
a b
1 r00001 1
2 r00010 2
4 r01234 4
Timing
df = pd.concat([df] * 10000, ignore_index=True)
%timeit df[[s.startswith('r') and (len(s) == 6) and s[1:].isdigit() for s in df.a.values.tolist()]]
%timeit df[df.a.str.match('^r\d{5}$')]
%timeit df[df.a.str.contains('^r\d{5}$')]
10 loops, best of 3: 22.8 ms per loop
10 loops, best of 3: 33.8 ms per loop
10 loops, best of 3: 34.8 ms per loop
You can use str.contains
and pass a regex pattern:
In[112]:
df.loc[df['a'].str.contains(r'^r\d{5}')]
Out[112]:
a b
1 r00001 1
2 r00010 2
4 r01234 4
Here the pattern evaluates to ^r
- start with character r, and then \d{5}
looks for 5 digits
startswith
looks for a character pattern, not a regex pattern which is why it fails
Regarding the difference between str.contains
and str.match
, they are analagous but str.contains
uses re.search
whilst str.match
uses re.match
which is more strict, see the docs.
edit
To answer your comment add $
so that it matches a specific number of characters, see related:
In[117]:
df = pd.DataFrame({'a': ['abc', 'r000010', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})
df
Out[117]:
a b
0 abc 0
1 r000010 1
2 r00010 2
3 rfoo 3
4 r01234 4
5 r1234 5
In[118]:
df.loc[df['a'].str.match(r'r\d{5}$')]
Out[118]:
a b
2 r00010 2
4 r01234 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With