Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select data using a regular expression

I have a dataframe like this

import pandas as pd

df = pd.DataFrame({'a': ['abc', 'r00001', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})

        a  b
0     abc  0
1  r00001  1
2  r00010  2
3    rfoo  3
4  r01234  4
5   r1234  5

I now want to select all columns of this dataframe where the entries in column a start with r followed by five numbers.

From here I learned how one would do this if it started just with r without the numbers:

print df.loc[df['a'].str.startswith('r'), :]

        a  b
1  r00001  1
2  r00010  2
3    rfoo  3
4  r01234  4
5   r1234  5

Something like this

print df.loc[df['a'].str.startswith(r'[r]\d{5}'), :]

does of course not work. How would one do this properly?

like image 700
Cleb Avatar asked Jul 06 '17 15:07

Cleb


2 Answers

Option 1
pd.Series.str.match

df.a.str.match('^r\d{5}$')

1     True
2     True
3    False
4     True
5    False
Name: a, dtype: bool

Use it as a filter

df[df.a.str.match('^r\d{5}$')]

        a  b
1  r00001  1
2  r00010  2
4  r01234  4

Option 2
Custom list comprehension using string methods

f = lambda s: s.startswith('r') and (len(s) == 6) and s[1:].isdigit()
[f(s) for s in df.a.values.tolist()]

[False, True, True, False, True, False]

Use it as a filter

df[[f(s) for s in df.a.values.tolist()]]

        a  b
1  r00001  1
2  r00010  2
4  r01234  4

Timing

df = pd.concat([df] * 10000, ignore_index=True)

%timeit df[[s.startswith('r') and (len(s) == 6) and s[1:].isdigit() for s in df.a.values.tolist()]]
%timeit df[df.a.str.match('^r\d{5}$')]
%timeit df[df.a.str.contains('^r\d{5}$')]

10 loops, best of 3: 22.8 ms per loop
10 loops, best of 3: 33.8 ms per loop
10 loops, best of 3: 34.8 ms per loop
like image 54
piRSquared Avatar answered Nov 03 '22 21:11

piRSquared


You can use str.contains and pass a regex pattern:

In[112]:
df.loc[df['a'].str.contains(r'^r\d{5}')]

Out[112]: 
        a  b
1  r00001  1
2  r00010  2
4  r01234  4

Here the pattern evaluates to ^r - start with character r, and then \d{5} looks for 5 digits

startswith looks for a character pattern, not a regex pattern which is why it fails

Regarding the difference between str.contains and str.match, they are analagous but str.contains uses re.search whilst str.match uses re.match which is more strict, see the docs.

edit

To answer your comment add $ so that it matches a specific number of characters, see related:

In[117]:
df = pd.DataFrame({'a': ['abc', 'r000010', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})
df

Out[117]: 
         a  b
0      abc  0
1  r000010  1
2   r00010  2
3     rfoo  3
4   r01234  4
5    r1234  5


In[118]:
df.loc[df['a'].str.match(r'r\d{5}$')]

Out[118]: 
        a  b
2  r00010  2
4  r01234  4
like image 22
EdChum Avatar answered Nov 03 '22 21:11

EdChum