I have a text file that looks like this:
garbage
moregarbaged89849843
MDeduri09ri44830
Some short sentence
Whatever ... key: d11001bfa937eee2f84f55a11b207356 (KID=01002d737832455680cffbadf1092baf)
Whatever2 ... key: a0ee2d0f8272355f750c5434db85291a (KID=0101bfa0ab9641a0b863ef76519a48d3)
Whatever3 ... key: fe216ba17e5af807ce5af8e43cf3c031 (KID=0102900a2bc54111833631ea7bb855ed)
77EB0A2C7C42EDC27A3D26E72A02BB29:01002d737832455680cffbadf1092baf status 'garbage'
blah blah:0101bfa0ab9641a0b863ef76519a48d3 has status 'usable'
77EB0A2C7C42EDC27A3D26E72A02BB29:blah blah
I only care about the key and KID parts, and want to extract them to separate lists
My regex for that is key: (\w|\d){30,} and KID=(\w|\d){30,} respectively.
Code I'm using is
matchkid = re.compile('KID=(\w|\d){30,}')
matchkey = re.compile('key: (\w|\d){30,}')
filteredkids = [a for a in lis if matchkid.search(a)]
filteredkeys = [b for b in lis if matchkey.search(b)]
print(filteredkids)
print('\n')
print(filteredkeys)
Where lis is a list made from the lines of the text document
The output is
['Whatever ... key: d11001bfa937eee2f84f55a11b207356 (KID=01002d737832455680cffbadf1092baf)', 'Whatever2 ... key: a0ee2d0f8272355f750c5434db85291a (KID=0101bfa0ab9641a0b863ef76519a48d3)', 'Whatever3 ... key: fe216ba17e5af807ce5af8e43cf3c031 (KID=0102900a2bc54111833631ea7bb855ed)']
['Whatever ... key: d11001bfa937eee2f84f55a11b207356 (KID=01002d737832455680cffbadf1092baf)', 'Whatever2 ... key: a0ee2d0f8272355f750c5434db85291a (KID=0101bfa0ab9641a0b863ef76519a48d3)', 'Whatever3 ... key: fe216ba17e5af807ce5af8e43cf3c031 (KID=0102900a2bc54111833631ea7bb855ed)']
Which is wrong, the desired output is
['KID=01002d737832455680cffbadf1092baf', 'KID=0101bfa0ab9641a0b863ef76519a48d3', 'KID=0102900a2bc54111833631ea7bb855ed']
['key: d11001bfa937eee2f84f55a11b207356', 'key: a0ee2d0f8272355f750c5434db85291a', 'key: fe216ba17e5af807ce5af8e43cf3c031']
I have tried tweaking my regex and looking at other similar questions, but nothing seems to work and most of the time I just get empty lists.
Hoping to find some guidance here, thanks in advance
The (\w|\d){30,} is not a good pattern as it creates a repeated capturing group, and is redundant itself: \w matches digits, too, so \w{30,} is enough.
Next, you are using re.search that only returns a Match data object, and you use listeneing comprehension to iterate over that object, while you need to grab all matches from your strings.
You can fix the code by using
filteredkids = re.findall(r'KID=\w{30,}', text)
filteredkeys = re.findall(r'key: \w{30,}', text)
See the Python demo:
import re
text = """garbage
moregarbaged89849843
MDeduri09ri44830
Some short sentence
Whatever ... key: d11001bfa937eee2f84f55a11b207356 (KID=01002d737832455680cffbadf1092baf)
Whatever2 ... key: a0ee2d0f8272355f750c5434db85291a (KID=0101bfa0ab9641a0b863ef76519a48d3)
Whatever3 ... key: fe216ba17e5af807ce5af8e43cf3c031 (KID=0102900a2bc54111833631ea7bb855ed)
77EB0A2C7C42EDC27A3D26E72A02BB29:01002d737832455680cffbadf1092baf status 'garbage'
blah blah:0101bfa0ab9641a0b863ef76519a48d3 has status 'usable'
77EB0A2C7C42EDC27A3D26E72A02BB29:blah blah"""
filteredkids = re.findall(r'KID=\w{30,}', text)
filteredkeys = re.findall(r'key: \w{30,}', text)
print( filteredkids )
print( filteredkeys )
Output:
['KID=01002d737832455680cffbadf1092baf', 'KID=0101bfa0ab9641a0b863ef76519a48d3', 'KID=0102900a2bc54111833631ea7bb855ed']
['key: d11001bfa937eee2f84f55a11b207356', 'key: a0ee2d0f8272355f750c5434db85291a', 'key: fe216ba17e5af807ce5af8e43cf3c031']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With