Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex, remove all punctuation except hyphen for unicode string

I have this code for removing all punctuation from a regex string:

import regex as re     re.sub(ur"\p{P}+", "", txt) 

How would I change it to allow hyphens? If you could explain how you did it, that would be great. I understand that here, correct me if I'm wrong, P with anything after it is punctuation.

like image 585
John Avatar asked Jan 18 '14 19:01

John


People also ask

How do you remove all punctuation from a string in Python regex?

Use regex to Strip Punctuation From a String in Python The regex pattern [^\w\s] captures everything which is not a word or whitespace(i.e. the punctuations) and replaces it with an empty string.

How do you ignore punctuation in Python?

One of the easiest ways to remove punctuation from a string in Python is to use the str. translate() method. The translate method typically takes a translation table, which we'll do using the . maketrans() method.

How do you remove punctuations from regular expressions?

You can use this: Regex. Replace("This is a test string, with lots of: punctuations; in it?!.", @"[^\w\s]", "");

How do I remove punctuation from a panda string?

To remove punctuation with Python Pandas, we can use the DataFrame's str. replace method. We call replace with a regex string that matches all punctuation characters and replace them with empty strings. replace returns a new DataFrame column and we assign that to df['text'] .


2 Answers

[^\P{P}-]+ 

\P is the complementary of \p - not punctuation. So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes.

Example: http://www.rubular.com/r/JsdNM3nFJ3

If you want a non-convoluted way, an alternative is \p{P}(?<!-): match all punctuation, and then check it wasn't a dash (using negative lookbehind).
Working example: http://www.rubular.com/r/5G62iSYTdk

like image 177
Kobi Avatar answered Sep 17 '22 13:09

Kobi


Here's how to do it with the re module, in case you have to stick with the standard libraries:

# works in python 2 and 3 import re import string  remove = string.punctuation remove = remove.replace("-", "") # don't remove hyphens pattern = r"[{}]".format(remove) # create the pattern  txt = ")*^%{}[]thi's - is - @@#!a !%%!!%- test." re.sub(pattern, "", txt)  # >>> 'this - is - a - test' 

If performance matters, you may want to use str.translate, since it's faster than using a regex. In Python 3, the code is txt.translate({ord(char): None for char in remove}).

like image 40
Galen Long Avatar answered Sep 18 '22 13:09

Galen Long