Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting multiple substrings from one string

I have the following string which I am parsing from another file : "CHEM1(5GL) CH3M2(55LB) CHEM3954114(50KG)" What I want to do is split them up into individual values, which I achieve using the .split() function. So I get them as an array:

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']

Now I want to further split them into 3 segments, and store them in 3 other variables so I can write them to excel as such :

a = CHEM1
b = 5
c = GL

for the first array, then I will loop back for the second array:

a = CH3M2
b = 55
c = LB

and finally :

a = CHEM3954114
b = 50
c = KG

I am unsure how to go about that as I am still new in python. To the best of my acknowledge I iterate multiple times with the split function, but I believe there has to be a better way to do it than that.

Thank you.

like image 684
Ahmed Avatar asked Jun 09 '26 21:06

Ahmed


1 Answers

You should use the re package:

import re

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']

pattern = re.compile("([^\(]+)\((\d+)(.+)\)")

for x1 in x:
    m = pattern.search(x1)
    if m:
        a, b, c = m.group(1), int(m.group(2)), m.group(3)

FOLLOW UP:

The regex topic is enormous and extremely well covered on this site - as Tim has highlighted above. I can share my thinking for this specific case. Essentially, there are 3 groups of characters you want to extract:

  1. All the characters (letters and numbers) up to the ( - not included
  2. The digits after the (
  3. The letters after the digits extracted in the previous step - up to the ) - not included.

A group is anything included between brackets (): in this specific case, it may become confusing because, as stressed above, you have brackets as part of sentence - which will need to be escaped with a \ to be distinguished from the ones used in the regular expression.

  • The first group is ([^\(]+), which essentially means: match one or more characters which are not ( (the ^ is the negation, and the bracket ( needs to be escaped here, for the reasons described above). Note that characters may include not only letters and numbers but also special characters like $, £, - and so forth. I wanted to keep my options open here, but you can be more laser guided if you need (including, for example, only numbers and letters using [\w]+)
  • The second group is (\d+), which is essentially matching 1 or more (expressed with +) digits (expressed with \d).
  • The last group is (.+) - match any remaining characters, with the final \) making sure that you match any remaining characters up to the closing bracket.
like image 88
nikeros Avatar answered Jun 12 '26 10:06

nikeros



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!