Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

whitespace in regular expression

Tags:

python

regex

I have a question, can I say \t is equivalent to \s+ in regular expression.? I have some lines of code :

>>> b = '\tNadya Carson'
>>> c = re.compile(r'\s\s*')
>>> c
<_sre.SRE_Pattern object at 0x02729800>
>>> c.sub('',b)
'NadyaCarson'
>>> c = re.compile(r'\s\s+')
>>> c
<_sre.SRE_Pattern object at 0x027292F0>

There is pattern object till here but when I want to substitute with no space, it still shows \t instead of substituting it:

>>> c.sub('',b)
'\tNadya Carson'

Why is the attribute sub not working in this case.? Thank you.!

like image 453
Tirtha Avatar asked Apr 22 '14 16:04

Tirtha


2 Answers

\t is not equivalent to \s+, but \s+ should match a tab (\t).

The problem in your example is that the second pattern \s\s+ is looking for two or more whitespace characters, and \t is only one whitespace character.

Here are some examples that should help you understand:

>>> result = re.match(r'\s\s+', '\t')
>>> print result
None
>>> result = re.match(r'\s\s+', '\t\t')
>>> print result
<_sre.SRE_Match object at 0x10ff228b8>

\s\s+ would also match ' \t', '\n\t', ' \n \t \t\n'.

Also, \s\s* is equivalent to \s+. Both will match one or more whitespace characters.

like image 110
Rob Watts Avatar answered Sep 22 '22 00:09

Rob Watts


\s+ is not equivalent to \t because \s does not mean <space>, but instead means <whitespace>. A literal space (sometimes four of which are used for tabs, depending on the application used to display them) is simply . That is, hitting the spacebar creates a literal space. That's hardly surprising.

\s\s will never match a \t because since \t IS whitespace, \s matches it. It will match \t\t, but that's because there's two characters of whitespace (both tab characters). When your regex runs \s\s+, it's looking for one character of whitespace followed by one, two, three, or really ANY number more. When it reads your regex it does this:

\s\s+

Regular expression visualization

Debuggex Demo

The \t matches the first \s, but when it hits the second one your regex spits it back out saying "Oh, nope nevermind."

Your first regex does this:

\s\s*

Regular expression visualization

Debuggex Demo

Again, the \t matches your first \s, and when the regex continues it sees that it doesn't match the second \s so it takes the "high road" instead and jumps over it. That's why \s\s* matches, because the * quantifier includes "or zero." while the + quantifier does not.

like image 34
Adam Smith Avatar answered Sep 20 '22 00:09

Adam Smith