Getting the token sequence number after regex matching in python

Question

I want to find all the elements in a list that match a regex. To decrease the number of times regex matching is done, I created a string by joining the elements delimited with a space, as given below:

list_a = ["4123", "7648", "afjsdn", "ujaf", "huh23"]
regex_num = r"\d+"
string_a = " ".join(list_a)
num_matches = re.findall(regex_num, string_a)

The list and the matches are as given below:

list_a:  ['4123', '7648', 'afjsdn', 'ujaf', 'huh23']
matches:  ['4123', '7648', '23']

Now that I have all my matches I want to know whether the match was part of the element/token or an entire token. One way I can do this is by comparing the match with the actual token/element:

"23" == "huh23"
False

But to do this, I would require the token serial number. Which isn't available directly. The only position information regex matching can provide is the span of the match which is at a character level.

The other path I could take is to just apply regex matching for all the elements by looping through the list and comparing the string with the match if there is a match.

I would like to reduce as much time complexity as possible for this operation.

Is there a more pythonic way of determining whether a match is just a part of the token or is there a more pythonic way to find the serial number of the matched word so that the initial list could be exploited for string comparison?

Any help would be appreciated. Thanks in advance!

Edit 1:

If my list is something like:

list_a = ["4123", "7648", "afjsdn", "ujaf", "huh23", "n23kl3l24"] like suggested by @Artyom Vancyan in the comments

The output I would like is:

matches_with_slno = [[0,'4123'], [1,'7648'], [4, '23'], [5, '23'], [5,'3'], [5, '24']

Artyom Vancyan · Accepted Answer

Using `yield from`

The most pythonic solution I would recommend is mixing enumerate with a generator.

import re

arr = ['4123', '7648', 'afjsdn', 'ujaf', 'huh23', 'n23kl3l24']


def process(array):
    for index, item in enumerate(array):
        yield from [[index, match] for match in re.findall(r"\d+", item)]


print(list(process(arr)))  # [[0, '4123'], [1, '7648'], [4, '23'], [5, '23'], [5, '3'], [5, '24']]

One of the usages of yield from is list flattening. Also, yield from cannot be used in a list comprehension; otherwise, we would have one line code. And we use enumerate to have an element's serial index (number). As yield is used, the process function becomes a generator.

NOTE: In the generator implementation, we use a loop and list comprehension as well.

Using list comprehension

import re

arr = ['4123', '7648', 'afjsdn', 'ujaf', 'huh23', 'n23kl3l24']

print([[index, match] for index, item in enumerate(arr) for match in re.findall(r"\d+", item)])  # [[0, '4123'], [1, '7648'], [4, '23'], [5, '23'], [5, '3'], [5, '24']]

Getting the token sequence number after regex matching in python

Tags:

python

python-re

Suneha K S

1 Answers

Using `yield from`

Using list comprehension

Artyom Vancyan

Recent Activity

Donate For Us

Getting the token sequence number after regex matching in python

Tags:

python

python-re

Suneha K S

1 Answers

Using yield from

Using list comprehension

Artyom Vancyan

Related questions

Recent Activity

Donate For Us

Using `yield from`