I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. é) or special characters (e.g. ¿). I have to be able to search for these characters in the sentence so I can determine if the sentence is in Spanish or English.
I've tried my best to accomplish this, but have had no luck in getting it right. Below is one of the solutions I tried, but clearly gave the wrong answer.
sentence = ¿Qué tipo es el? #in str format, received from standard open file method
sentence = sentence.decode('latin-1')
print 'é'.decode('latin-1') in sentence
>>> False
I've also tried using codecs.open(.., .., 'latin-1') to read in the file instead, but that didn't help. Then I tried u'é'.encode('latin-1'), and that didn't work.
I'm out of ideas here, any suggestions?
@icktoofay provided the solution. I ended up keeping the decoding of the file (using latin-1), but then using the Python unicode for the characters (u'é'
). This required me to set the Python unicode encoding at the top of the script. The final step was to use the unicodedata.normalize
method to normalize both strings, then compare accordingly. Thank you guys for the prompt and great support.
Method: To check if a special character is present in a given string or not, firstly group all special characters as one set. Then using for loop and if statements check for special characters. If any special character is found then increment the value of c.
Python String startswith()The startswith() method returns True if a string starts with the specified prefix(string). If not, it returns False .
if symbols == symbols. isalpha(): ... will test if symbols , your input string, is equal to the result of symbols. isalpha() which returns a boolean True or False .
Individual characters in a string can be accessed by specifying the string name followed by a number in square brackets ( [] ). String indexing in Python is zero-based: the first character in the string has index 0 , the next has index 1 , and so on.
Use unicodedata.normalize
on the string before checking.
Unicode offers multiple forms to create some characters. For example, á
could be represented with a single character, á
, or two characters: a
, then 'put a ´
on top of that'. Normalizing the string will force it to one or the other of the representations. (which representation it normalizes to depends on what you pass as the form
parameter)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With