Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a string based on a pattern in Python

Tags:

python

regex

I have long strings such as

"123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"

and

"321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"

I want to split them based on the pattern "a number, a space, a dash, a space, some string until the next number, a space, a dash, a space or end of string". Notice that the string may contain commas, ampersands, '>' and other special characters, so splitting on them will not work. I think there is a way in Python to split based on regular expressions but I have trouble forming that.

I have a very introductory knowledge of regular expressions. I can form a regex for numbers, as well as for alphanumeric strings, but I don't know how to specify "take everything until the next number starts".


Update: Expected output:

first case:

["123 - Footwear", "5678 - Apparel, Accessories & Luxury Goods", "9876 - Leisure Products"]

second case:

["321 - Apparel & Accessories", "4321 - Apparel & Accessories > Handbags, Wallets & Cases", "187 - Apparel & Accessories > Shoes"]

like image 208
Tapal Goosal Avatar asked Jul 26 '18 08:07

Tapal Goosal


People also ask

Can I use regex with split Python?

Regex to Split string with multiple delimitersWith the regex split() method, you will get more flexibility. You can specify a pattern for the delimiters where you can specify multiple delimiters, while with the string's split() method, you could have used only a fixed character or set of characters to split a string.

How do you split a string by the occurrences of a regex pattern Python?

Introduction to the Python regex split() function The built-in re module provides you with the split() function that splits a string by the matches of a regular expression. In this syntax: pattern is a regular expression whose matches will be used as separators for splitting. string is an input string to split.

How do you split a string by a specific character in Python?

Splitting Strings with the split() Function We can specify the character to split a string by using the separator in the split() function. By default, split() will use whitespace as the separator, but we are free to provide other characters if we want.


2 Answers

Here is the pattern, first there is some number so we use [0-9]+ followed by string and special characters like & - >, therefore we can use [a-zA-Z \-&>]+:

>>> str_ = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['123 - Footwear, ',
 '5678 - Apparel, Accessories & Luxury Goods, ',
 '9876 - Leisure Products']

Another string you mentioned in OP

>>> str_ = "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['321 - Apparel & Accessories, ', 
 '4321 - Apparel & Accessories > Handbags, Wallets & Cases, ', 
 '187 - Apparel & Accessories > Shoes']
like image 95
akash karothiya Avatar answered Oct 03 '22 09:10

akash karothiya


If numbers appear only at the beginning of each segment of strings, you can do:

import re
for s in "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes":
    print(re.findall(r'\d+\D+(?=,\s*\d|$)', s))

This outputs:

['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']

This regex pattern uses \d+ to match numbers first, then uses \D+ to match non-numbers, and uses lookahead pattern (?=,\s*\d|$) to make sure that the non-numbers stops at the point where it's followed by either a comma, some spaces and another number, or the end of the string, so that the resulting match won't include a trailing comma and a space.

like image 27
blhsing Avatar answered Oct 03 '22 08:10

blhsing