Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex -- extraneous matchings

Tags:

python

regex

I want to split a string using -, +=, ==, =, +, and white-space as delimiters. I want to keep the delimiter unless it is white-space.

I've tried to achieve this with the following code:

def tokenize(s):
  import re
  pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
  return pattern.split(s)

print(tokenize("hello-+==== =+  there"))

I expected the output to be

['hello', '-', '+=', '==', '=', '=', '+', 'there']

However I got

['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']

Which is almost what I wanted, except that there are quite a few extraneous Nones and empty strings.

Why is it behaving this way, and how might I change it to get what I want?

like image 927
math4tots Avatar asked May 10 '13 18:05

math4tots


3 Answers

re.split by default returns an array of the bits of strings that are in between the matches: (As @Laurence Gonsalves notes, this is its main use.)

['hello', '', '', '', '', '', '', '', 'there']

Note the empty strings in between - and +=, += and ==, etc.

As the docs explain, because you're using a capture group (i.e., because you're using (\-|\+\=|\=\=|\=|\+) instead of (?:\-|\+\=|\=\=|\=|\+), the bits that the capture group matches are interspersed:

['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']

None corresponds to where the \s+ half of your pattern was matched; in those cases, the capture group captured nothing.

From looking at the docs for re.split, I don't see an easy way to have it discard empty strings in between matches, although a simple list comprehension (or filter, if you prefer) can easily discard Nones and empty strings:

def tokenize(s):
  import re
  pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
  return [ x for x in pattern.split(s) if x ]

One last note: For what you've described so far, this will work fine, but depending on the direction your project goes, you may want to switch to a proper parsing library. The Python wiki has a good overview of some of the options here.

like image 156
Josh Kelley Avatar answered Nov 08 '22 02:11

Josh Kelley


Why is it behaving this way?

According to the documentation for re.split:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

This is literally correct: if capturing parentheses are used, then the text of all groups are returned, whether or not they matched anything; the ones which didn't match anything return None.

As always with split, two consecutive delimiters are considered to separate empty strings, so you get empty strings interspersed.

how might I change it to get what I want?

The simplest solution is to filter the output:

filter(None, pattern.split(s))
like image 32
rici Avatar answered Nov 08 '22 00:11

rici


Perhaps re.findall would be more suitable for you?

>>> re.findall(r'-|\+=|==|=|\+|[^-+=\s]+', "hello-+==== =+  there")
['hello', '-', '+=', '==', '=', '=', '+', 'there']
like image 2
Janne Karila Avatar answered Nov 08 '22 01:11

Janne Karila