Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove punctuation from Unicode formatted strings

Tags:

python

unicode

I have a function that removes punctuation from a list of strings:

def strip_punctuation(input):     x = 0     for word in input:         input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x])         x += 1     return input 

I recently modified my script to use Unicode strings so I could handle other non-Western characters. This function breaks when it encounters these special characters and just returns empty Unicode strings. How can I reliably remove punctuation from Unicode formatted strings?

like image 441
acpigeon Avatar asked Jun 16 '12 19:06

acpigeon


People also ask

How do you remove punctuation from a string processing?

The standard solution to remove punctuations from a String is using the replaceAll() method. It can remove each substring of the string that matches the given regular expression. You can use the POSIX character class \p{Punct} for creating a regular expression that finds punctuation characters.

How do you remove punctuation and spaces from a string?

One of the easiest ways to remove punctuation from a string in Python is to use the str. translate() method. The translate method typically takes a translation table, which we'll do using the . maketrans() method.

How do I remove punctuation from a panda string?

To remove punctuation with Python Pandas, we can use the DataFrame's str. replace method. We call replace with a regex string that matches all punctuation characters and replace them with empty strings. replace returns a new DataFrame column and we assign that to df['text'] .

How do I remove punctuation from a set in Python?

We can use replace() method to remove punctuation from python string by replacing each punctuation mark by empty string. We will iterate over the entire punctuation marks one by one replace it by an empty string in our text string.


1 Answers

You could use unicode.translate() method:

import unicodedata import sys  tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)                       if unicodedata.category(unichr(i)).startswith('P')) def remove_punctuation(text):     return text.translate(tbl) 

You could also use r'\p{P}' that is supported by regex module:

import regex as re  def remove_punctuation(text):     return re.sub(ur"\p{P}+", "", text) 
like image 109
jfs Avatar answered Oct 09 '22 21:10

jfs