Select data using a regular expression

Question

I have a dataframe like this

import pandas as pd

df = pd.DataFrame({'a': ['abc', 'r00001', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})

        a  b
0     abc  0
1  r00001  1
2  r00010  2
3    rfoo  3
4  r01234  4
5   r1234  5

I now want to select all columns of this dataframe where the entries in column a start with r followed by five numbers.

From here I learned how one would do this if it started just with r without the numbers:

print df.loc[df['a'].str.startswith('r'), :]

        a  b
1  r00001  1
2  r00010  2
3    rfoo  3
4  r01234  4
5   r1234  5

Something like this

print df.loc[df['a'].str.startswith(r'[r]\d{5}'), :]

does of course not work. How would one do this properly?

piRSquared · Accepted Answer

Option 1
pd.Series.str.match

df.a.str.match('^r\d{5}$')

1     True
2     True
3    False
4     True
5    False
Name: a, dtype: bool

Use it as a filter

df[df.a.str.match('^r\d{5}$')]

        a  b
1  r00001  1
2  r00010  2
4  r01234  4

Option 2
Custom list comprehension using string methods

f = lambda s: s.startswith('r') and (len(s) == 6) and s[1:].isdigit()
[f(s) for s in df.a.values.tolist()]

[False, True, True, False, True, False]

Use it as a filter

df[[f(s) for s in df.a.values.tolist()]]

        a  b
1  r00001  1
2  r00010  2
4  r01234  4

Timing

df = pd.concat([df] * 10000, ignore_index=True)

%timeit df[[s.startswith('r') and (len(s) == 6) and s[1:].isdigit() for s in df.a.values.tolist()]]
%timeit df[df.a.str.match('^r\d{5}$')]
%timeit df[df.a.str.contains('^r\d{5}$')]

10 loops, best of 3: 22.8 ms per loop
10 loops, best of 3: 33.8 ms per loop
10 loops, best of 3: 34.8 ms per loop

EdChum · Answer

You can use str.contains and pass a regex pattern:

In[112]:
df.loc[df['a'].str.contains(r'^r\d{5}')]

Out[112]: 
        a  b
1  r00001  1
2  r00010  2
4  r01234  4

Here the pattern evaluates to ^r - start with character r, and then \d{5} looks for 5 digits

startswith looks for a character pattern, not a regex pattern which is why it fails

Regarding the difference between str.contains and str.match, they are analagous but str.contains uses re.search whilst str.match uses re.match which is more strict, see the docs.

edit

To answer your comment add $ so that it matches a specific number of characters, see related:

In[117]:
df = pd.DataFrame({'a': ['abc', 'r000010', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})
df

Out[117]: 
         a  b
0      abc  0
1  r000010  1
2   r00010  2
3     rfoo  3
4   r01234  4
5    r1234  5


In[118]:
df.loc[df['a'].str.match(r'r\d{5}$')]

Out[118]: 
        a  b
2  r00010  2
4  r01234  4

Select data using a regular expression

Tags:

python

regex

pandas

Cleb

2 Answers

piRSquared

EdChum

Recent Activity

Donate For Us

Select data using a regular expression

Tags:

python

regex

pandas

Cleb

2 Answers

piRSquared

EdChum

Related questions

Recent Activity

Donate For Us