Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the Python way of doing a \G anchored parsing loop?

Tags:

python

regex

What follows is a perl function I wrote years ago. It is a smart tokenizer that recognizes some instances of things being stuck together that maybe shouldn't be. For example, given the input on the left, it divides the string as shown on the right:

abc123  -> abc|123
abcABC  -> abc|ABC
ABC123  -> ABC|123
123abc  -> 123|abc
123ABC  -> 123|ABC
AbcDef  -> Abc|Def    (e.g. CamelCase)
ABCDef  -> ABC|Def    
1stabc  -> 1st|abc    (recognize valid ordinals)
1ndabc  -> 1|ndabc    (but not invalid ordinals)
11thabc -> 11th|abc   (recognize that 11th - 13th are different than 1st - 3rd)
11stabc -> 11|stabc

I'm now doing some machine learning experimentation, and I'd like to do some experiments that use this tokenizer. But first, I will need to port it from Perl to Python. The key to this code is the loop that uses the \G anchor, something which I hear does not exist in python. I've tried googling for how this is done in Python, but I am not sure what exactly to search for, so I'm having trouble finding an answer.

How would you write this function in Python?

sub Tokenize
# Breaks a string into tokens using special rules,
# where a token is any sequence of characters, be they a sequence of letters, 
# a sequence of numbers, or a sequence of non-alpha-numeric characters
# the list of tokens found are returned to the caller
{
    my $value = shift;
    my @list = ();
    my $word;

    while ( $value ne '' && $value =~ m/
        \G                # start where previous left off
        ([^a-zA-Z0-9]*)   # capture non-alpha-numeric characters, if any
        ([a-zA-Z0-9]*?)   # capture everything up to a token boundary
        (?:               # identify the token boundary
            (?=[^a-zA-Z0-9])       # next character is not a word character 
        |   (?=[A-Z][a-z])         # Next two characters are upper lower
        |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
        |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
                # ordinal boundaries
        |   (?<=^1(?i:st))         # first
        |   (?<=[^1][1](?i:st))    # first but not 11th
        |   (?<=^2(?i:nd))         # second
        |   (?<=[^1]2(?i:nd))      # second but not 12th
        |   (?<=^3(?i:rd))         # third
        |   (?<=[^1]3(?i:rd))      # third but not 13th
        |   (?<=1[123](?i:th))     # 11th - 13th
        |   (?<=[04-9](?i:th))     # other ordinals
                # non-ordinal digit-letter boundaries
        |   (?<=^1)(?=[a-zA-Z])(?!(?i)st)       # digit-letter but not first
        |   (?<=[^1]1)(?=[a-zA-Z])(?!(?i)st)    # digit-letter but not 11th
        |   (?<=^2)(?=[a-zA-Z])(?!(?i)nd)       # digit-letter but not first
        |   (?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd)    # digit-letter but not 12th
        |   (?<=^3)(?=[a-zA-Z])(?!(?i)rd)       # digit-letter but not first
        |   (?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd)    # digit-letter but not 13th
        |   (?<=1[123])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not 11th - 13th
        |   (?<=[04-9])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not ordinal
        |   (?=$)                               # end of string
        )
    /xg )
    {
        push @list, $1 if $1 ne '';
        push @list, $2 if $2 ne '';
    }
    return @list;
}

I did try using re.split() with a variation on the above. However, split() refuses to split on a zero-width match (an ability that should be possible if one really knows what one is doing).

I did come up with a solution to this specific problem, but not to the general problem of "how do I use \G based parsing" - I have some sample code that does regexes within loops that are anchored using \G and then in the body it uses another match anchored at \G to see which way to proceed with the parse. So I'm still looking for an answer.

That said, here is my final working code for translating the above to Python:

import re

IsA                 = lambda s: '['  + s + ']'
IsNotA              = lambda s: '[^' + s + ']'

Upper               = IsA( 'A-Z' )
Lower               = IsA( 'a-z' )
Letter              = IsA( 'a-zA-Z' )
Digit               = IsA( '0-9' )
AlphaNumeric        = IsA( 'a-zA-Z0-9' )
NotAlphaNumeric     = IsNotA( 'a-zA-Z0-9' ) 

EndOfString         = '$'
OR                  = '|'

ZeroOrMore          = lambda s: s + '*'
ZeroOrMoreNonGreedy = lambda s: s + '*?'
OneOrMore           = lambda s: s + '+'
OneOrMoreNonGreedy  = lambda s: s + '+?'

StartsWith          = lambda s: '^' + s
Capture             = lambda s: '('    + s + ')'
PreceededBy         = lambda s: '(?<=' + s + ')'
FollowedBy          = lambda s: '(?='  + s + ')'
NotFollowedBy       = lambda s: '(?!'  + s + ')'
StopWhen            = lambda s: s
CaseInsensitive     = lambda s: '(?i:' + s + ')'

ST                  = '(?:st|ST)'
ND                  = '(?:nd|ND)'
RD                  = '(?:rd|RD)'
TH                  = '(?:th|TH)'

def OneOf( *args ):
  return '(?:' + '|'.join( args ) + ')'

pattern = '(.+?)' + \
  OneOf( 
    # ABC | !!! - break at whitespace or non-alpha-numeric boundary
    PreceededBy( AlphaNumeric ) + FollowedBy( NotAlphaNumeric ),
    PreceededBy( NotAlphaNumeric ) + FollowedBy( AlphaNumeric ),

    # ABC | Abc - break at what looks like the start of a word or sentence
    FollowedBy( Upper + Lower ),

    # abc | ABC - break when a lower-case letter is followed by an upper case
    PreceededBy( Lower )  + FollowedBy( Upper ),

    # abc | 123 - break between words and digits
    PreceededBy( Letter ) + FollowedBy( Digit ),

    # 1st | oak - recognize when the string starts with an ordinal
    PreceededBy( StartsWith( '1' + ST ) ),
    PreceededBy( StartsWith( '2' + ND ) ),
    PreceededBy( StartsWith( '3' + RD ) ),

    # 1st | abc - contains an ordinal
    PreceededBy( IsNotA( '1' ) + '1' + ST ),
    PreceededBy( IsNotA( '1' ) + '2' + ND ),
    PreceededBy( IsNotA( '1' ) + '3' + RD ),
    PreceededBy( '1' + IsA( '123' )  + TH ),
    PreceededBy( IsA( '04-9' )       + TH ),

    # 1 | abcde - recognize when it starts with or contains a non-ordinal digit/letter boundary
    PreceededBy( StartsWith( '1' ) ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
    PreceededBy( StartsWith( '2' ) ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
    PreceededBy( StartsWith( '3' ) ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
    PreceededBy( IsNotA( '1' ) + '1' ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
    PreceededBy( IsNotA( '1' ) + '2' ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
    PreceededBy( IsNotA( '1' ) + '3' ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
    PreceededBy( '1' + IsA( '123' ) )  + FollowedBy( Letter ) + NotFollowedBy( TH ),
    PreceededBy( IsA( '04-9' ) )       + FollowedBy( Letter ) + NotFollowedBy( TH ),

    # abcde | $ - end of the string
    FollowedBy( EndOfString )
  )

matcher = re.compile( pattern )

def tokenize( s ):
  return matcher.findall( s )
like image 383
John Arrowwood Avatar asked Dec 06 '15 19:12

John Arrowwood


1 Answers

Emulate \G at beginning of a regex with re.RegexObject.match

You can emulate the effect of \G at the beginning of a regex with re module by keeping track of and providing the starting position to re.RegexObject.match, which forces the match to begin at the specified position in pos.

def tokenize(w):
    index = 0
    m = matcher.match(w, index)
    o = []
    # Although index != m.end() check zero-length match, it's more of
    # a guard against accidental infinite loop.
    # Don't expect a regex which can match empty string to work.
    # See Caveat section.
    while m and index != m.end():
        o.append(m.group(1))
        index = m.end()
        m = matcher.match(w, index)
    return o

Caveat

A caveat to this method is that it doesn't play well with regex which matches empty string in the main match, since Python doesn't have any facility to force the regex to retry the match while preventing zero-length match.

As an example, re.findall(r'(.??)', 'abc') returns an array of 4 empty strings ['', '', '', ''], whereas in PCRE, you can find 7 matches ['', 'a', '', 'b', '', 'c' ''] where the 2nd, 4th, and 6th matches start at the same indices as the 1st, 3rd and 5th matches respectively. The additional matches in PCRE are found by retrying at the same indices with a flag which prevents empty string match.

I know the question is about Perl, not PCRE, but the global matching behavior should be the same. Otherwise, the original code couldn't have worked.

Rewriting ([^a-zA-Z0-9]*)([a-zA-Z0-9]*?) to (.+?), as done in the question, avoids this issue, though you might want to use re.S flag.

Other comments on the regex

Since case-insensitive flag in Python affects the whole pattern, the case insensitive sub-patterns have to be rewritten. I would rewrite (?i:st) as [sS][tT] to preserve the original meaning, but go with (?:st|ST) if it's part of your requirement.

Since Python supports the free-spacing mode with re.X flag, you can write your regex similar to what you did in Perl code:

matcher = re.compile(r'''
    (.+?)
    (?:               # identify the token boundary
        (?=[^a-zA-Z0-9])       # next character is not a word character 
    |   (?=[A-Z][a-z])         # Next two characters are upper lower
    |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
    |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
            # ordinal boundaries
    |   (?<=^1[sS][tT])         # first
    |   (?<=[^1][1][sS][tT])    # first but not 11th
    |   (?<=^2[nN][dD])         # second
    |   (?<=[^1]2[nN][dD])      # second but not 12th
    |   (?<=^3[rR][dD])         # third
    |   (?<=[^1]3[rR][dD])      # third but not 13th
    |   (?<=1[123][tT][hH])     # 11th - 13th
    |   (?<=[04-9][tT][hH])     # other ordinals
            # non-ordinal digit-letter boundaries
    |   (?<=^1)(?=[a-zA-Z])(?![sS][tT])       # digit-letter but not first
    |   (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT])    # digit-letter but not 11th
    |   (?<=^2)(?=[a-zA-Z])(?![nN][dD])       # digit-letter but not first
    |   (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD])    # digit-letter but not 12th
    |   (?<=^3)(?=[a-zA-Z])(?![rR][dD])       # digit-letter but not first
    |   (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD])    # digit-letter but not 13th
    |   (?<=1[123])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not 11th - 13th
    |   (?<=[04-9])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not ordinal
    |   (?=$)                               # end of string
    )
''', re.X)
like image 143
nhahtdh Avatar answered Nov 03 '22 09:11

nhahtdh