This question is linked to Searching for Unicode characters in Python
I read unicode text file using python codecs
codecs.open('story.txt', 'rb', 'utf-8-sig')
And was trying to search strings in it. But i'm getting the following warning.
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Is there any special way of unicode string comparison ?
The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since codings map only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail.
The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode() to fail.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.
You may use the ==
operator to compare unicode objects for equality.
>>> s1 = u'Hello' >>> s2 = unicode("Hello") >>> type(s1), type(s2) (<type 'unicode'>, <type 'unicode'>) >>> s1==s2 True >>> >>> s3='Hello'.decode('utf-8') >>> type(s3) <type 'unicode'> >>> s1==s3 True >>>
But, your error message indicates that you aren't comparing unicode objects. You are probably comparing a unicode
object to a str
object, like so:
>>> u'Hello' == 'Hello' True >>> u'Hello' == '\x81\x01' __main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal False
See how I have attempted to compare a unicode object against a string which does not represent a valid UTF8 encoding.
Your program, I suppose, is comparing unicode objects with str objects, and the contents of a str object is not a valid UTF8 encoding. This seems likely the result of you (the programmer) not knowing which variable holds unicide, which variable holds UTF8 and which variable holds the bytes read in from a file.
I recommend http://nedbatchelder.com/text/unipain.html, especially the advice to create a "Unicode Sandwich."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With