I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. é) or special characters (e.g. ¿). I have to be able to search for these characters in the sentence so I can determine if the sentence is in Spanish or English. I've tried my best to accomplish this, but have had no luck in getting it right. Below is one of the solutions I tried, but clearly gave the wrong answer. <pre class="prettyprint"><code>sentence = ¿Qué tipo es el? #in str format, received from standard open file method sentence = sentence.decode('latin-1') print 'é'.decode('latin-1') in sentence >>> False </code></pre> I've also tried using codecs.open(.., .., 'latin-1') to read in the file instead, but that didn't help. Then I tried u'é'.encode('latin-1'), and that didn't work. I'm out of ideas here, any suggestions? @icktoofay provided the solution. I ended up keeping the decoding of the file (using latin-1), but then using the Python unicode for the characters (<code>u'é'</code>). This required me to set the Python unicode encoding at the top of the script. The final step was to use the <code>unicodedata.normalize</code> method to normalize both strings, then compare accordingly. Thank you guys for the prompt and great support.

Use <code>unicodedata.normalize</code> on the string before checking. <h3>Explanation</h3> Unicode offers multiple forms to create some characters. For example, <code>á</code> could be represented with a single character, <code>á</code>, or two characters: <code>a</code>, then 'put a <code>´</code> on top of that'. Normalizing the string will force it to one or the other of the representations. (which representation it normalizes to depends on what you pass as the <code>form</code> parameter)

How to find accented characters in a string in Python?

Tags:

python

string

unicode

I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. é) or special characters (e.g. ¿). I have to be able to search for these characters in the sentence so I can determine if the sentence is in Spanish or English.

I've tried my best to accomplish this, but have had no luck in getting it right. Below is one of the solutions I tried, but clearly gave the wrong answer.

sentence = ¿Qué tipo es el? #in str format, received from standard open file method
sentence = sentence.decode('latin-1')
print 'é'.decode('latin-1') in sentence
>>> False

I've also tried using codecs.open(.., .., 'latin-1') to read in the file instead, but that didn't help. Then I tried u'é'.encode('latin-1'), and that didn't work.

I'm out of ideas here, any suggestions?

@icktoofay provided the solution. I ended up keeping the decoding of the file (using latin-1), but then using the Python unicode for the characters (u'é'). This required me to set the Python unicode encoding at the top of the script. The final step was to use the unicodedata.normalize method to normalize both strings, then compare accordingly. Thank you guys for the prompt and great support.

348

asked Nov 10 '12 20:11

user1411331

1 Answers

Use unicodedata.normalize on the string before checking.

Explanation

Unicode offers multiple forms to create some characters. For example, á could be represented with a single character, á, or two characters: a, then 'put a ´ on top of that'. Normalizing the string will force it to one or the other of the representations. (which representation it normalizes to depends on what you pass as the form parameter)

136

answered Sep 29 '22 01:09

icktoofay

Related questions
                            
                                Matplotlib: How to set range of theta for a polar plot?
                            
                                Tkinter Keyboard Binds
                            
                                How to config flask app.logger from a configure file?
                            
                                Sub-pixel rendering in OpenGL - accuracy issue
                            
                                How to modify the <head> in Sphinx documentation so the relative links are updated?
                            
                                TF-IDF Simple Use - NLTK/Scikit Learn
                            
                                Killing processes with psutil
                            
                                Pika worker throws exception when running channel.declare_queue
                            
                                In Python, what is the difference between "class name(object):" and "class name():"
                            
                                Why does comparison of a numpy array with a list consume so much memory?
                            
                                Make Python unittest fail on exception from any thread
                            
                                How to set the name of a QThread in pyqt?
                            
                                Integer to boolean conversion in count() method
                            
                                Vectorize this convolution type loop more efficiently in numpy
                            
                                OpenCV face detection is slow on Raspberry Pi
                            
                                Python tkinter: stopping event propagation in text widgets tags
                            
                                What is a good storage candidate for soft-realtime data acquisition under Linux?
                            
                                Package-scoped fixtures in pytest 2.3
                            
                                Drawing lines between pairs in Python
                            
                                Python/Django: automatically log when exceptions occur, including request info

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With