Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the token sequence number after regex matching in python

I want to find all the elements in a list that match a regex. To decrease the number of times regex matching is done, I created a string by joining the elements delimited with a space, as given below:

list_a = ["4123", "7648", "afjsdn", "ujaf", "huh23"]
regex_num = r"\d+"
string_a = " ".join(list_a)
num_matches = re.findall(regex_num, string_a)

The list and the matches are as given below:

list_a:  ['4123', '7648', 'afjsdn', 'ujaf', 'huh23']
matches:  ['4123', '7648', '23']

Now that I have all my matches I want to know whether the match was part of the element/token or an entire token. One way I can do this is by comparing the match with the actual token/element:

"23" == "huh23"
False

But to do this, I would require the token serial number. Which isn't available directly. The only position information regex matching can provide is the span of the match which is at a character level.

The other path I could take is to just apply regex matching for all the elements by looping through the list and comparing the string with the match if there is a match.

I would like to reduce as much time complexity as possible for this operation.

Is there a more pythonic way of determining whether a match is just a part of the token or is there a more pythonic way to find the serial number of the matched word so that the initial list could be exploited for string comparison?

Any help would be appreciated. Thanks in advance!

Edit 1:

If my list is something like:

list_a = ["4123", "7648", "afjsdn", "ujaf", "huh23", "n23kl3l24"] like suggested by @Artyom Vancyan in the comments

The output I would like is:

matches_with_slno = [[0,'4123'], [1,'7648'], [4, '23'], [5, '23'], [5,'3'], [5, '24']
like image 205
Suneha K S Avatar asked Mar 07 '26 18:03

Suneha K S


1 Answers

Using yield from

The most pythonic solution I would recommend is mixing enumerate with a generator.

import re

arr = ['4123', '7648', 'afjsdn', 'ujaf', 'huh23', 'n23kl3l24']


def process(array):
    for index, item in enumerate(array):
        yield from [[index, match] for match in re.findall(r"\d+", item)]


print(list(process(arr)))  # [[0, '4123'], [1, '7648'], [4, '23'], [5, '23'], [5, '3'], [5, '24']]

One of the usages of yield from is list flattening. Also, yield from cannot be used in a list comprehension; otherwise, we would have one line code. And we use enumerate to have an element's serial index (number). As yield is used, the process function becomes a generator.

NOTE: In the generator implementation, we use a loop and list comprehension as well.

Using list comprehension

import re

arr = ['4123', '7648', 'afjsdn', 'ujaf', 'huh23', 'n23kl3l24']

print([[index, match] for index, item in enumerate(arr) for match in re.findall(r"\d+", item)])  # [[0, '4123'], [1, '7648'], [4, '23'], [5, '23'], [5, '3'], [5, '24']]
like image 61
Artyom Vancyan Avatar answered Mar 09 '26 06:03

Artyom Vancyan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!