Splitting a string based on a pattern in Python

Tags:

regex

I have long strings such as

"123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"

and

"321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"

I want to split them based on the pattern "a number, a space, a dash, a space, some string until the next number, a space, a dash, a space or end of string". Notice that the string may contain commas, ampersands, '>' and other special characters, so splitting on them will not work. I think there is a way in Python to split based on regular expressions but I have trouble forming that.

I have a very introductory knowledge of regular expressions. I can form a regex for numbers, as well as for alphanumeric strings, but I don't know how to specify "take everything until the next number starts".

Update: Expected output:

first case:

["123 - Footwear", "5678 - Apparel, Accessories & Luxury Goods", "9876 - Leisure Products"]

second case:

["321 - Apparel & Accessories", "4321 - Apparel & Accessories > Handbags, Wallets & Cases", "187 - Apparel & Accessories > Shoes"]

208

asked Jul 26 '18 08:07

Tapal Goosal

2 Answers

Here is the pattern, first there is some number so we use [0-9]+ followed by string and special characters like & - >, therefore we can use [a-zA-Z \-&>]+:

>>> str_ = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['123 - Footwear, ',
 '5678 - Apparel, Accessories & Luxury Goods, ',
 '9876 - Leisure Products']

Another string you mentioned in OP

>>> str_ = "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['321 - Apparel & Accessories, ', 
 '4321 - Apparel & Accessories > Handbags, Wallets & Cases, ', 
 '187 - Apparel & Accessories > Shoes']

answered Oct 03 '22 09:10

akash karothiya

If numbers appear only at the beginning of each segment of strings, you can do:

import re
for s in "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes":
    print(re.findall(r'\d+\D+(?=,\s*\d|$)', s))

This outputs:

['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']

This regex pattern uses \d+ to match numbers first, then uses \D+ to match non-numbers, and uses lookahead pattern (?=,\s*\d|$) to make sure that the non-numbers stops at the point where it's followed by either a comma, some spaces and another number, or the end of the string, so that the resulting match won't include a trailing comma and a space.

answered Oct 03 '22 08:10

blhsing

Related questions
                            
                                Plot Time Only in Matplotlib (Instead of DateTime)
                            
                                Pyspark 'NoneType' object has no attribute '_jvm' error
                            
                                How do I repair conda after a system crash?
                            
                                vertical line tkinter using grid
                            
                                Extracting groups in a regex match
                            
                                How can I remove duplicate tuples from a list based on index value of tuple while maintaining the order of tuple? [duplicate]
                            
                                Telegram bot initiate conversation with a user
                            
                                Heroku fails to install pywin32 library
                            
                                Django - Override model save()
                            
                                How do I convert this complex SQL into a Django model query?
                            
                                Conditional Styling in Pandas using other columns
                            
                                Simulating argparse command line arguments input while debugging
                            
                                jinja2.exceptions.TemplateSyntaxError: expected token ',', got 'static'
                            
                                remove entries with nan values in python dictionary
                            
                                Python Program Won't Run - psycopg2 rename warning
                            
                                Django test parallel AppRegistryNotReady
                            
                                Splitting pandas dataframe column (into two) after the first letter in the cell
                            
                                Using pandas applymap() with multiple mapping functions
                            
                                VS Code: Tell pylint to ignore the next line?
                            
                                Pandas Python : how to create multiple columns from a list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With