Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression result

Tags:

python

regex

I have below code:

import re

line = "78349999234";

searchObj = re.search(r'9*', line)

if searchObj:
   print "searchObj.group() : ", searchObj.group()
else:
   print "Nothing found!!"

However the output is empty. I thought * means: Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s. Why am I not able to see any result in this case?

like image 284
user3369157 Avatar asked Oct 14 '14 23:10

user3369157


People also ask

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

What does \\ s+ mean in regex?

The plus sign + is a greedy quantifier, which means one or more times. For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

What does =~ mean in Ruby regex?

=~ is Ruby's basic pattern-matching operator. When one operand is a regular expression and the other is a string then the regular expression is used as a pattern to match against the string. (This operator is equivalently defined by Regexp and String so the order of String and Regexp do not matter.


2 Answers

I think the regular expression matches left to right. So the first pattern that matches is the empty string before 7.... If it find a 9, it will indeed match it greedy: and try to "eat" (that's the correct terminology) as many characters as possible.

If you query for:

>>> print(re.findall(r'9*',line));
['', '', '', '', '9999', '', '', '', '']

It matches all empty strings between the characters and as you can see, 9999 is matched as well.

The main reason is probably performance: if you search for a pattern in a string of 10M+ characters, you're very happy if the pattern is already in the first 10k characters. You don't want to waste effort on finding the "nicest" match...


EDIT

With 0 or more occurrence one means the group (in this case 9) is repeated zero or more times. In an empty string, the characters is repeated exactly 0 times. If you want to match patterns where the characters is repeated one or more times, you should use

9+

This results in:

>>> print(re.search(r'9+', line));
<_sre.SRE_Match object; span=(4, 8), match='9999'>

re.search for a pattern that accepts the empty string, is probably not that much helpful since it will always match the empty string before the actual start of the string first.

like image 138
Willem Van Onsem Avatar answered Sep 27 '22 18:09

Willem Van Onsem


The main reason is , re.search function stops searching for strings once it finds a match. 9* means match the digit 9 zero or more times. Because an empty string exists before each and every character, re.search function stops it searching after finding the first empty string. That's why you got an empty string as output...

like image 25
Avinash Raj Avatar answered Sep 27 '22 17:09

Avinash Raj