Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract specific strings using Python Regex

Tags:

python

regex

I have very challenging strings that I have been struggling.
For example,

str1 = '95% for Pikachu, 92% for Sandshrew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'

String starts with % integer and may have for or not, then following with name of pokemon. There might be comma(,) or & sign then new % integer. Finally there is another name of pokemon.(All start with capital case alphabet)
I want to extract two pokemons, for example result,

['Pikachu', 'Sandshrew']
['Paras', 'Arcanine']
['Diglett', 'Dugtrio']
['Squirtle', 'Alakazam']
['Metopod', 'Dewgong']

I can create a list of all pokemen then using in syntax, but it is not the best way(In case they add more pokemon). Is it possible to extract using Regex?
Thanks in advance!
EDIT
As requested, I am adding my code,

str_list = [str1, str2, str3, str4, str5]

for x in str_list:
    temp_list = []
    if 'for' in x:
        temp = x.split('% for', 1)[1].strip()
        temp_list.append(temp)
    else:
        temp = x.split(" ", 1)[1]
        temp_list.append(temp)
print(temp_list)

I know it is not regex express. The expression I tried is, \d+ to extract integer to start... but have no idea how to start.
EDIT2
@b_c has good edge case so, I am adding it here

edge_str = '100% for Pikachu, 29% Pika Pika Pikachu'

result

['Pikachu', 'Pika Pika Pikachu']
like image 498
jayko03 Avatar asked Feb 15 '26 18:02

jayko03


2 Answers

Hopefully I didn't over engineer this, but I wanted to cover the edge cases of the slightly-more-complicated named pokemon, such as "Mr. Mime", "Farfetch'd", and/or "Nidoran♂" (only looking at the first 151).

The pattern I used is (?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*, which looks to be working in my testing (here's the regex101 link for a breakdown).

For a general summary, I'm looking for:

  • 1+ digits followed by a %
  • A space or the word "for" at least once
  • (To start the capture) A starting capital letter
  • At least one of (ending the capture group):
    • a word character, a period, the male/female symbols, or an apostrophe
      • Note: If you want to catch additional "weird" pokemon characters, like numbers, colon, etc., add them to this portion (the [\w\.♀♂'] bit).
    • OR a space, but only if followed by a capital letter
  • A comma, space, or ampersand, any number of times

Unless it's changed, Python's builtin re module doesn't support repeated capture groups (which I believe I did correctly), so I just used re.findall and organized them into pairs (I replaced a couple names from your input with the complicated ones):

import re

str1 = '95% for Pikachu, 92% for Mr. Mime'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = "10% Squirtle, 100% for Farfetch'd"
str5 = '30% Metopod & 99% Nidoran♂'

pattern = r"(?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*"

# Find matches in each string, then unpack each list of
# matches into a flat list
all_matches = [match
               for s in [str1, str2, str3, str4, str5]
               for match in re.findall(pattern, s)]

# Pair up the matches
pairs = zip(all_matches[::2], all_matches[1::2])

for pair in pairs:
    print(pair)

This then prints out:

('Pikachu', 'Mr. Mime')
('Paras', 'Arcanine')
('Diglett', 'Dugtrio')
('Squirtle', "Farfetch'd")
('Metopod', 'Nidoran♂')

Also, as was already mentioned, you do have a few typos in the pokemon names, but a regex isn't the right fix for that unfortunately :)

like image 106
b_c Avatar answered Feb 17 '26 07:02

b_c


Since there seems to be no other upper-case letter in your strings you can simply use [A-Z]\w+ as regex. See regex101

Code:

import re

str1 = '95% for Pikachu, 92% for Sandsherew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'

str_list = [str1, str2, str3, str4, str5]
regex = re.compile('[A-Z]\w+')
pokemon_list = []
for x in str_list:
    pokemon_list.append(re.findall(regex, x))
print(pokemon_list)

Output:

[['Pikachu', 'Sandsherew'], ['Paras', 'Arcanine'], ['Diglett', 'Dugtrio'], ['Squirtle', 'Alakazam'], ['Metopod', 'Dewgong']]
like image 44
LeoE Avatar answered Feb 17 '26 07:02

LeoE