Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

string.translate() with unicode data in python

Tags:

I have 3 API's that return json data to 3 dictionary variables. I am taking some of the values from the dictionary to process them. I read the specific values that I want to the list valuelist. One of the steps is to remove the punctuation from them. I normally use string.translate(None, string.punctuation) for this process but because the dictionary data is unicode I get the error:

    wordlist = [s.translate(None, string.punctuation)for s in valuelist] TypeError: translate() takes exactly one argument (2 given) 

Is there a way around this? Either by encoding the unicode or a replacement for string.translate?

like image 693
adohertyd Avatar asked Jul 27 '12 16:07

adohertyd


People also ask

How do I get unicode of a string in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

What does unicode () do in Python?

Remarks. If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.

What is unicode data type in Python?

Type 'unicode' is meant for working with codepoints of characters. Type 'str' is meant for working with encoded binary representation of characters. A 'unicode' object needs to be converted to 'str' object before Python can write the character to a file.


1 Answers

The translate method work differently on Unicode objects than on byte-string objects:

 >>> help(unicode.translate)  S.translate(table) -> unicode  Return a copy of the string S, where all characters have been mapped through the given translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted. 

So your example would become:

remove_punctuation_map = dict((ord(char), None) for char in string.punctuation) word_list = [s.translate(remove_punctuation_map) for s in value_list] 

Note however that string.punctuation only contains ASCII punctuation. Full Unicode has many more punctuation characters, but it all depends on your use case.

like image 55
Simon Sapin Avatar answered Sep 28 '22 08:09

Simon Sapin