Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split at word boundaries with regexes?

Tags:

python

regex

nlp

I'm trying to do this:

import re
sentence = "How are you?"
print(re.split(r'\b', sentence))

The result being

[u'How are you?']

I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?

like image 361
oarfish Avatar asked May 15 '16 11:05

oarfish


People also ask

How does word boundary work in regex?

The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).

Is a word boundary in regex?

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.

How do I split a string into a list of words?

The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.

Does split regex?

You do not only have to use literal strings for splitting strings into an array with the split method. You can use regex as breakpoints that match more characters for splitting a string.


2 Answers

Unfortunately, Python cannot split by empty strings.

To get around this, you would need to use findall instead of split.

Actually \b just means word boundary.

It is equivalent to (?<=\w)(?=\W)|(?<=\W)(?=\w).

That means, the following code would work:

import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))
like image 64
Kenny Lau Avatar answered Oct 20 '22 01:10

Kenny Lau


import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)

Output:

['How', 'are', 'you', '?']

Ideone Demo

Regex101 Demo


Regex Explanation:

"[\w']+|[.,!?;]"

    1st Alternative: [\w']+
        [\w']+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
            \w match any word character [a-zA-Z0-9_]
            ' the literal character '
    2nd Alternative: [.,!?;]
        [.,!?;] match a single character present in the list below
            .,!?; a single character in the list .,!?; literally
like image 2
Pedro Lobito Avatar answered Oct 20 '22 02:10

Pedro Lobito