Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding the 're.search()' behavior in Python

Here is the python code i have used to split up letters and digits from a string of alphanumerics:

input_string = 'abcdefghijklmnopqrstuvwxyz1234567890'
import re
print re.search('[a-z]*', input_string).group()
print re.search('[0-9]*', input_string).group()

In output i am getting the string of letters but not getting the string of digits. If i modify the code like following the output is showing the digits:

print re.search('[0-9]*$', input_string).group()

I am used to grep and i found it's functionalities are similar to those of re module, if i run the following command in shell i get the desired result:

echo "abcdefghijklmnopqrstuvwxyz1234567890" | grep "[0-9]*"

Am i missing something here?

like image 797
heemayl Avatar asked Dec 25 '22 23:12

heemayl


2 Answers

I suggest you to use re.findall function (in-order to do a global match) instead of re.search because re.search would return only the first match.

>>> input_string = 'abcdefghijklmnopqrstuvwxyz1234567890'
>>> print re.findall(r'\d+|[a-z]+', input_string)
['abcdefghijklmnopqrstuvwxyz', '1234567890']

And also don't use [a-z]*, it would return empty strings also. * would repeat the previous token zero or more times where + would repeat the previous token one or more times.

>>> print re.search(r'\d+', input_string).group()
1234567890
>>> print re.search(r'[a-z]+', input_string).group()
abcdefghijklmnopqrstuvwxyz

Why the first one works where the second fails?

>>> print re.search('[a-z]*', input_string).group()
abcdefghijklmnopqrstuvwxyz
>>> print re.search('[0-9]*', input_string).group()

>>>

* repeats the previous token zero or more times ie, it would match an empty string which exists before each non-matching characters. First [a-z]* returns abcdefghijklmnopqrstuvwxyz because this substring was located at the start. If the input is like 8abcdefghijklmnopqrstuvwxyz, it would return an empty string. This behaviour is because of re.search function, where it stops after finding the first match. Here 8 is not matched by the above regex, so as i said, [a-z]* regex would match the empty string which exists just before to the 8.

regex = [0-9]*, string = "abcdefghijklmnopqrstuvwxyz1234567890"

re.search stops after finding the first match. Here a is not matched by [0-9] but [0-9]* matches the empty string which exists before a because * would repeat the previous token zero or more times. That's why you got an empty string as output in the second case.

>>> print re.search('[0-9]*$', input_string).group()
1234567890

Since we added an end of the line anchor, it would search for zero or more digits at the line end. It would return an empty string as match if it finds no more digits at the last.

>>> print re.search('[0-9]*$', '12foo').group()

>>> 
like image 122
Avinash Raj Avatar answered Jan 06 '23 10:01

Avinash Raj


In output i am getting the string of letters but not getting the string of digits.

I just checked both ruby and perl, as well, and they produce the same results.

The digit pattern matches:

  1. The zero-width spot that is between the first character and the second character.
  2. The zero-width spot that is between the second character and the third character.
  3. etc.
  4. The sequence of numbers at the end of the string.

However, re.search() only returns the first match.

The lower case letter pattern matches:

  1. The sequence of letters at the beginning of the string.
  2. The zero-width spot between the 1 and 2.
  3. The zero-width spot between the 2 and 3.
  4. etc.

if i run the following command in shell i get the desired result:

echo "abcdefghijklmnopqrstuvwxyz1234567890" | grep "[0-9]*"

In a bash shell, I get:

$ echo "abcdefghijklmnopqrstuvwxyz1234567890" | grep "[0-9]*"
abcdefghijk

And I get similar strange results with echo, grep, and other patterns.

Response to comment:

$ bash --version
GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin10.0)
Copyright (C) 2007 Free Software Foundation, Inc.

$ echo "abc123" | grep -o "[a-z]*"
abc
$ echo "abc123" | grep -o "[0-9]*"
$ echo "abc123" | grep -o "[0-9]*$"
123
$ 
like image 28
7stud Avatar answered Jan 06 '23 10:01

7stud