Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why my regex with r'string' matches but not 'string' using Python?

Tags:

python

regex

The way regex works in Python is so intensely puzzling that it makes me more furious with each passing second. Here's my problem:

I understand that this gives a result:

re.search(r'\bmi\b', 'grand rapids, mi 49505)

while this doesn't:

re.search('\bmi\b', 'grand rapids, mi 49505)

And that's okay. I get that much of it. Now, I have a regular expression that's being generated like this:

regex = '|'.join(['\b' + str(state) + '\b' for state in states])

If I now do re.search(regex, 'grand rapids, mi 49505'), it fails for the same reason my second search() example fails.

My question: Is there any way to do what I'm trying to do?

like image 212
Jason Swett Avatar asked Feb 05 '11 21:02

Jason Swett


2 Answers

The anwser itself

regex = '|'.join([r'\b' + str(state) + r'\b' for state in states])

The reason behind this is that the 'r' prefix tells Python to not analyze the string you pass to it. If you don't put an 'r' before the string, Python will try to turn any char preceding by '\' into a special char, to allow you to enter break lines (\n), tabs (\t) and such easily.

When you do '\b', you tell Python to create a string, analyse it, and transform '\b' into 'backspace', while when you do r'\b', Python just store '\' then 'b', and this is what you want with for regex. Always use 'r' for string used as regex patterns.

The 'r' notation is called 'raw string', but that's misleading, as there is no such thing as a raw string in Python internals. Just think about it as a way to tell Python to avoid being too smart.

There is another notation in Python < 3.0, u'string', that tells Python to store the string as unicode. You can combine both: ur"é\n" will store "\bé" as unicode, while u"é\n" will store "é" then a line break.

Some ways to improve your code:

regex = '|'.join(r'\b' + str(state) + r'\b' for state in states)

Removed the extra []. It tells Python to not store in memory the list of values you are generating. We can do it here because we don't plan to reuse the list you are creating since you use it directly in your join() and nowhere else.

regex = '|'.join(r'\b%s\b' % state for state in states)

This will take care of the string conversion automatically and is shorter and cleaner. When you format string in Python, think about the % operator.

If states contain a list of states zip code, then there should be stored as string, not as int. In that case, you can skip the type casting and shorten it even more:

regex = r'\b%s\b' % r'\b|\b'.join(states)

Eventually, you may not need regex at all. If all you care is to check if one of the zip code is in the given string, you can just use in (check if an item is in an iterable, like if a string is in a list):

matches = [s for s in states if s in 'grand rapids, mi 49505']

Last word

I understand you may be frustrated when learning a new language, but take the time to give a proper title to your question. In this website, the title should end with a question mark and give specific details about the problem.

like image 77
e-satis Avatar answered Oct 12 '22 23:10

e-satis


The solution is the one you used yourself in the example above: raw strings.

regex = '|'.join(r'\b' + str(state) + r'\b' for state in states)

(Note that I also removed the extra brackets, turning the list comprehension into a generator expression.)

like image 26
Sven Marnach Avatar answered Oct 13 '22 01:10

Sven Marnach