How to make a group for each word in a sentence?

Question

This may be a silly question but...

Say you have a sentence like:

The quick brown fox

Or you might get a sentence like:

The quick brown fox jumped over the lazy dog

The simple regexp (\w*) finds the first word "The" and puts it in a group.

For the first sentence, you could write (\w*)\s*(\w*)\s*(\w*)\s*(\w*)\s* to put each word in its own group, but that assumes you know the number of words in the sentence.

Is it possible to write a regular expression that puts each word in any arbitrary sentence into its own group? It would be nice if you could do something like (?:(\w*)\s*)* to have it group each instance of (\w*), but that doesn't work.

I am doing this in Python, and my use case is obviously a little more complex than "The quick brown fox", so it would be nifty if Regex could do this in one line, but if that's not possible then I assume the next best solution is to loop over all the matches using re.findall() or something similar.

Thanks for any insight you may have.

Edit: For completeness's sake here's my actual use case and how I solved it using your help. Thanks again.

>>> s = '1 0 5 test1 5 test2 5 test3 5 test4 5 test5'
>>> s = re.match(r'^\d+\s\d+\s?(.*)', s).group(1)
>>> print s
5 test1 5 test2 5 test3 5 test4 5 test5
>>> list = re.findall(r'\d+\s(\w+)', s)
>>> print list
['test1', 'test2', 'test3', 'test4', 'test5']

razpeitia · Accepted Answer

You can also use the function findall in the module re

import re
>>> re.findall("\w+", "The quick brown fox")
['The', 'quick', 'brown', 'fox']

markets · Answer

I don't believe that it is possible. Regexes pair the captures with the parentheses in the given regular expression... if you only listed one group, like '((\w+)\s+){0,99}', then it would just repeatedly capture to the same first and second group... not create new groups for each match found.

You could use split, but that only splits on one character value, not a class of characters like whitespace.

Instead, you can use re.split, which can split on a regular expression, and give it '\s' to match any whitespace. You probably want it to match '\s+' to gather the whitespace greedily.

>>> import re
>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.

>>> re.split('\s+', 'The   quick brown	 fox')
['The', 'quick', 'brown', 'fox']
>>>

How to make a group for each word in a sentence?

Tags:

python

regex

regex-group

blah238

2 Answers

razpeitia

markets

Recent Activity

Donate For Us

How to make a group for each word in a sentence?

Tags:

python

regex

regex-group

blah238

2 Answers

razpeitia

markets

Related questions

Recent Activity

Donate For Us