I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'a': ['abc', 'r00001', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})
a b
0 abc 0
1 r00001 1
2 r00010 2
3 rfoo 3
4 r01234 4
5 r1234 5
I now want to select all columns of this dataframe where the entries in column a start with r followed by five numbers.
From here I learned how one would do this if it started just with r without the numbers:
print df.loc[df['a'].str.startswith('r'), :]
a b
1 r00001 1
2 r00010 2
3 rfoo 3
4 r01234 4
5 r1234 5
Something like this
print df.loc[df['a'].str.startswith(r'[r]\d{5}'), :]
does of course not work. How would one do this properly?
Option 1pd.Series.str.match
df.a.str.match('^r\d{5}$')
1 True
2 True
3 False
4 True
5 False
Name: a, dtype: bool
Use it as a filter
df[df.a.str.match('^r\d{5}$')]
a b
1 r00001 1
2 r00010 2
4 r01234 4
Option 2
Custom list comprehension using string methods
f = lambda s: s.startswith('r') and (len(s) == 6) and s[1:].isdigit()
[f(s) for s in df.a.values.tolist()]
[False, True, True, False, True, False]
Use it as a filter
df[[f(s) for s in df.a.values.tolist()]]
a b
1 r00001 1
2 r00010 2
4 r01234 4
Timing
df = pd.concat([df] * 10000, ignore_index=True)
%timeit df[[s.startswith('r') and (len(s) == 6) and s[1:].isdigit() for s in df.a.values.tolist()]]
%timeit df[df.a.str.match('^r\d{5}$')]
%timeit df[df.a.str.contains('^r\d{5}$')]
10 loops, best of 3: 22.8 ms per loop
10 loops, best of 3: 33.8 ms per loop
10 loops, best of 3: 34.8 ms per loop
You can use str.contains and pass a regex pattern:
In[112]:
df.loc[df['a'].str.contains(r'^r\d{5}')]
Out[112]:
a b
1 r00001 1
2 r00010 2
4 r01234 4
Here the pattern evaluates to ^r - start with character r, and then \d{5} looks for 5 digits
startswith looks for a character pattern, not a regex pattern which is why it fails
Regarding the difference between str.contains and str.match, they are analagous but str.contains uses re.search whilst str.match uses re.match which is more strict, see the docs.
edit
To answer your comment add $ so that it matches a specific number of characters, see related:
In[117]:
df = pd.DataFrame({'a': ['abc', 'r000010', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})
df
Out[117]:
a b
0 abc 0
1 r000010 1
2 r00010 2
3 rfoo 3
4 r01234 4
5 r1234 5
In[118]:
df.loc[df['a'].str.match(r'r\d{5}$')]
Out[118]:
a b
2 r00010 2
4 r01234 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With