I have long strings such as
"123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
and
"321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"
I want to split them based on the pattern "a number, a space, a dash, a space, some string until the next number, a space, a dash, a space or end of string". Notice that the string may contain commas, ampersands, '>' and other special characters, so splitting on them will not work. I think there is a way in Python to split based on regular expressions but I have trouble forming that.
I have a very introductory knowledge of regular expressions. I can form a regex for numbers, as well as for alphanumeric strings, but I don't know how to specify "take everything until the next number starts".
Update: Expected output:
first case:
["123 - Footwear", "5678 - Apparel, Accessories & Luxury Goods", "9876 - Leisure Products"]
second case:
["321 - Apparel & Accessories", "4321 - Apparel & Accessories > Handbags, Wallets & Cases", "187 - Apparel & Accessories > Shoes"]
Regex to Split string with multiple delimitersWith the regex split() method, you will get more flexibility. You can specify a pattern for the delimiters where you can specify multiple delimiters, while with the string's split() method, you could have used only a fixed character or set of characters to split a string.
Introduction to the Python regex split() function The built-in re module provides you with the split() function that splits a string by the matches of a regular expression. In this syntax: pattern is a regular expression whose matches will be used as separators for splitting. string is an input string to split.
Splitting Strings with the split() Function We can specify the character to split a string by using the separator in the split() function. By default, split() will use whitespace as the separator, but we are free to provide other characters if we want.
Here is the pattern, first there is some number so we use [0-9]+
followed by string and special characters like &
-
>
, therefore we can use [a-zA-Z \-&>]+
:
>>> str_ = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['123 - Footwear, ',
'5678 - Apparel, Accessories & Luxury Goods, ',
'9876 - Leisure Products']
Another string you mentioned in OP
>>> str_ = "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['321 - Apparel & Accessories, ',
'4321 - Apparel & Accessories > Handbags, Wallets & Cases, ',
'187 - Apparel & Accessories > Shoes']
If numbers appear only at the beginning of each segment of strings, you can do:
import re
for s in "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes":
print(re.findall(r'\d+\D+(?=,\s*\d|$)', s))
This outputs:
['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']
This regex pattern uses \d+
to match numbers first, then uses \D+
to match non-numbers, and uses lookahead pattern (?=,\s*\d|$)
to make sure that the non-numbers stops at the point where it's followed by either a comma, some spaces and another number, or the end of the string, so that the resulting match won't include a trailing comma and a space.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With