Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python "re" package, strange phenomenon with "raw" string

Tags:

python

regex

I am seeing the following phenomenon, couldn't seem to figure it out, and didn't find anything with some search through archives:

if I type in:

>>> if re.search(r'\n',r'this\nis\nit'):<br>
...     print 'found it!'<br>
... else:<br>
...     print "didn't find it"<br>
... 

I will get:

didn't find it!

However, if I type in:

>>> if re.search(r'\\n',r'this\nis\nit'):<br>
...     print 'found it!'<br>
... else:<br>
...     print "didn't find it"<br>
... 

Then I will get:

found it!

(The first one only has one backslash on the r'\n' whereas the second one has two backslashes in a row on the r'\\n' ... even this interpreter is removing one of them.)

I can guess what is going on, but I don't understand the official mechanism as to why this is happening: in the first case, I need to escape two things: both the regular expression and the special strings. "Raw" lets me escape the special strings, but not the regular expression.

But there will never be a regular expression in the second string, since it is the string being matched. So there is only a need to escape once.

However, something doesn't seem consistent to me: how am I supposed to ensure that the characters REALLY ARE taken literally in the first case? Can I type rr'' ? Or do I have to ensure that I escape things twice? On a similar vein, how do I ensure that a variable is taken literally (or that it is NOT taken literally)? E.g., what if I had a variable tmp = 'this\nis\nmy\nhome', and I really wanted to find the literal combination of a slash and an 'n', instead of a newline?

Thanks!
Mike

like image 719
Mike Williamson Avatar asked Jun 28 '11 00:06

Mike Williamson


3 Answers

re.search(r'\n', r'this\nis\nit')

As you said, "there will never be a regular expression in the second string." So we need to look at these strings differently: the first string is a regex, the second just a string. Usually your second string will not be raw, so any backslashes are Python-escapes, not regex-escapes.

So the first string consists of a literal "\" and an "n". This is interpreted by the regex parser as a newline (docs: "Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser"). So your regex will be searching for a newline character.

Your second string consists of the string "this" followed by a literal "\" and an "n". So this string does not contain an actual newline character. Your regex will not match.

As for your second regex:

re.search(r'\\n', r'this\nis\nit')

This version matches because your regex contains three characters: a literal "\", another literal "\" and an "n". The regex parser interprets the two slashes as a single "\" character, followed by an "n". So your regex will be searching for a "\" followed by an "n", which is found within the string. But that isn't very helpful, since it has nothing to do with newlines.

Most likely what you want is to drop the r from the second string, thus treating it as a normal Python string.

re.search(r'\n', 'this\nis\nit')

In this case, your regex (as before) is searching for a newline character. And, it finds it, because the second string contains the word "this" followed by a newline.

like image 171
mgiuca Avatar answered Nov 10 '22 20:11

mgiuca


Escaping special sequences in string literals is one thing, escaping regular expression special characters is another. The row string modifier only effects the former.

Technically, re.search accepts two strings and passes the first to the regex builder with re.compile. The compiled regex object is used to search patterns inside simple strings. The second string is never compiled and thus it is not subject to regex special character rules.

If the regex builder receives a \n after the string literal is processed, it converts this sequence to a newline character. You also have to escape it if you need the match the sequence instead.

All rationale behind this is that regular expressions are not part of the language syntax. They are rather handled within the standard library inside the re module with common building blocks of the language.

The re.compile function uses special characters and escaping rules compatible with most commonly used regex implementations. However, the Python interpreter is not aware of the whole regular expression concept and it does not know whether a string literal will be compiled into a regex object or not. As a result, Python can't provide any kind syntax simplification such as the ones you suggested.

like image 2
Tugrul Ates Avatar answered Nov 10 '22 19:11

Tugrul Ates


Regexes have their own meaning for literal backslashes, as character classes like \d. If you actually want a literal backslash character, you will in fact need to double-escape it. It's really not supposed to be parallel since you're comparing a regex to a string.

Raw strings are just a convenience, and it would be way overkill to have double-raw strings.

like image 1
Mu Mind Avatar answered Nov 10 '22 18:11

Mu Mind