Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract matching substrings from a list to new list in Python

I have a text file that looks like this:

garbage
moregarbaged89849843
MDeduri09ri44830
Some short sentence
Whatever ... key: d11001bfa937eee2f84f55a11b207356 (KID=01002d737832455680cffbadf1092baf)
Whatever2 ... key: a0ee2d0f8272355f750c5434db85291a (KID=0101bfa0ab9641a0b863ef76519a48d3)
Whatever3 ... key: fe216ba17e5af807ce5af8e43cf3c031 (KID=0102900a2bc54111833631ea7bb855ed)
77EB0A2C7C42EDC27A3D26E72A02BB29:01002d737832455680cffbadf1092baf status 'garbage'
blah blah:0101bfa0ab9641a0b863ef76519a48d3 has status 'usable'
77EB0A2C7C42EDC27A3D26E72A02BB29:blah blah

I only care about the key and KID parts, and want to extract them to separate lists

My regex for that is key: (\w|\d){30,} and KID=(\w|\d){30,} respectively.

Code I'm using is

matchkid = re.compile('KID=(\w|\d){30,}')
matchkey = re.compile('key: (\w|\d){30,}')

filteredkids = [a for a in lis if matchkid.search(a)]
filteredkeys = [b for b in lis if matchkey.search(b)]

print(filteredkids)
print('\n')
print(filteredkeys)

Where lis is a list made from the lines of the text document

The output is

['Whatever ... key: d11001bfa937eee2f84f55a11b207356 (KID=01002d737832455680cffbadf1092baf)', 'Whatever2 ... key: a0ee2d0f8272355f750c5434db85291a (KID=0101bfa0ab9641a0b863ef76519a48d3)', 'Whatever3 ... key: fe216ba17e5af807ce5af8e43cf3c031 (KID=0102900a2bc54111833631ea7bb855ed)']


['Whatever ... key: d11001bfa937eee2f84f55a11b207356 (KID=01002d737832455680cffbadf1092baf)', 'Whatever2 ... key: a0ee2d0f8272355f750c5434db85291a (KID=0101bfa0ab9641a0b863ef76519a48d3)', 'Whatever3 ... key: fe216ba17e5af807ce5af8e43cf3c031 (KID=0102900a2bc54111833631ea7bb855ed)']

Which is wrong, the desired output is

['KID=01002d737832455680cffbadf1092baf', 'KID=0101bfa0ab9641a0b863ef76519a48d3', 'KID=0102900a2bc54111833631ea7bb855ed']

['key: d11001bfa937eee2f84f55a11b207356', 'key: a0ee2d0f8272355f750c5434db85291a', 'key: fe216ba17e5af807ce5af8e43cf3c031']

I have tried tweaking my regex and looking at other similar questions, but nothing seems to work and most of the time I just get empty lists.

Hoping to find some guidance here, thanks in advance

like image 870
qwerty Avatar asked Feb 11 '26 10:02

qwerty


1 Answers

The (\w|\d){30,} is not a good pattern as it creates a repeated capturing group, and is redundant itself: \w matches digits, too, so \w{30,} is enough.

Next, you are using re.search that only returns a Match data object, and you use listeneing comprehension to iterate over that object, while you need to grab all matches from your strings.

You can fix the code by using

filteredkids = re.findall(r'KID=\w{30,}', text)
filteredkeys = re.findall(r'key: \w{30,}', text)

See the Python demo:

import re
text = """garbage
moregarbaged89849843
MDeduri09ri44830
Some short sentence
Whatever ... key: d11001bfa937eee2f84f55a11b207356 (KID=01002d737832455680cffbadf1092baf)
Whatever2 ... key: a0ee2d0f8272355f750c5434db85291a (KID=0101bfa0ab9641a0b863ef76519a48d3)
Whatever3 ... key: fe216ba17e5af807ce5af8e43cf3c031 (KID=0102900a2bc54111833631ea7bb855ed)
77EB0A2C7C42EDC27A3D26E72A02BB29:01002d737832455680cffbadf1092baf status 'garbage'
blah blah:0101bfa0ab9641a0b863ef76519a48d3 has status 'usable'
77EB0A2C7C42EDC27A3D26E72A02BB29:blah blah"""
filteredkids = re.findall(r'KID=\w{30,}', text)
filteredkeys = re.findall(r'key: \w{30,}', text)
print( filteredkids )
print( filteredkeys )

Output:

['KID=01002d737832455680cffbadf1092baf', 'KID=0101bfa0ab9641a0b863ef76519a48d3', 'KID=0102900a2bc54111833631ea7bb855ed']
['key: d11001bfa937eee2f84f55a11b207356', 'key: a0ee2d0f8272355f750c5434db85291a', 'key: fe216ba17e5af807ce5af8e43cf3c031']
like image 95
Wiktor Stribiżew Avatar answered Feb 13 '26 23:02

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!