I am parsing a text on non alphanumeric characters and would like to exclude specific characters like apostrophes, dash/hyphens and commas.
I would like to build a regex for the following cases:
This is what i have tried:
def split_text(text):
my_text = re.split('\W',text)
# the following doesn't work.
#my_text = re.split('([A-Z]\w*)',text)
#my_text = re.split("^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$",text)
return my_text
Any ideas
is this what you want?
non-alphanumeric character, excluding apostrophes and hypens
my_text = re.split(r"[^\w'-]+",text)
non-alphanumeric character, excluding commas,apostrophes and hypens
my_text = re.split(r"[^\w-',]+",text)
the [] syntax defines a character class, [^..] "complements" it, i.e. it negates it.
See the documentation about that:
Characters that are not within a range can be matched by complementing the set. If the first character of the set is
'^', all the characters that are not in the set will be matched. For example,[^5]will match any character except'5', and[^^]will match any character except'^'.^has no special meaning if it’s not the first character in the set.
You can use a negated character class for this:
my_text = re.split(r"[^\w'-]+",text)
or
my_text = re.split(r"[^\w,'-]+",text) # also excludes commas
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With