Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find accented characters in a string in Python?

I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. é) or special characters (e.g. ¿). I have to be able to search for these characters in the sentence so I can determine if the sentence is in Spanish or English.

I've tried my best to accomplish this, but have had no luck in getting it right. Below is one of the solutions I tried, but clearly gave the wrong answer.

sentence = ¿Qué tipo es el? #in str format, received from standard open file method
sentence = sentence.decode('latin-1')
print 'é'.decode('latin-1') in sentence
>>> False

I've also tried using codecs.open(.., .., 'latin-1') to read in the file instead, but that didn't help. Then I tried u'é'.encode('latin-1'), and that didn't work.

I'm out of ideas here, any suggestions?

@icktoofay provided the solution. I ended up keeping the decoding of the file (using latin-1), but then using the Python unicode for the characters (u'é'). This required me to set the Python unicode encoding at the top of the script. The final step was to use the unicodedata.normalize method to normalize both strings, then compare accordingly. Thank you guys for the prompt and great support.

like image 348
user1411331 Avatar asked Nov 10 '12 20:11

user1411331


People also ask

How do I find special characters in a string in Python?

Method: To check if a special character is present in a given string or not, firstly group all special characters as one set. Then using for loop and if statements check for special characters. If any special character is found then increment the value of c.

How do you check if a string starts with a special character in Python?

Python String startswith()The startswith() method returns True if a string starts with the specified prefix(string). If not, it returns False .

How do you identify symbols in Python?

if symbols == symbols. isalpha(): ... will test if symbols , your input string, is equal to the result of symbols. isalpha() which returns a boolean True or False .

How do you reference a character in a string in Python?

Individual characters in a string can be accessed by specifying the string name followed by a number in square brackets ( [] ). String indexing in Python is zero-based: the first character in the string has index 0 , the next has index 1 , and so on.


1 Answers

Use unicodedata.normalize on the string before checking.

Explanation

Unicode offers multiple forms to create some characters. For example, á could be represented with a single character, á, or two characters: a, then 'put a ´ on top of that'. Normalizing the string will force it to one or the other of the representations. (which representation it normalizes to depends on what you pass as the form parameter)

like image 136
icktoofay Avatar answered Sep 29 '22 01:09

icktoofay