In a text file, there is a string "I don't like this".
However, when I read it into a string, it becomes "I don\xe2\x80\x98t like this". I understand that \u2018 is the unicode representation of "'". I use
f1 = open (file1, "r")
text = f1.read()
command to do the reading.
Now, is it possible to read the string in such a way that when it is read into the string, it is "I don't like this", instead of "I don\xe2\x80\x98t like this like this"?
Second edit: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion?
Individual characters in a string can be accessed by specifying the string name followed by a number in square brackets ( [] ). String indexing in Python is zero-based: the first character in the string has index 0 , the next has index 1 , and so on.
Which function is used to read all the characters? content = fh. read().
fgetc() function is a file handling function in C programming language which is used to read a character from a file.
We use the getc() and putc() I/O functions to read a character from a file and write a character to a file respectively. Syntax of getc: char ch = getc(fptr); Where, fptr is a file pointer.
Ref: http://docs.python.org/howto/unicode
Reading Unicode from a file is therefore simple:
import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
for line in f:
print repr(line)
It's also possible to open files in update mode, allowing both reading and writing:
with codecs.open('test', encoding='utf-8', mode='w+') as f:
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
EDIT: I'm assuming that your intended goal is just to be able to read the file properly into a string in Python. If you're trying to convert to an ASCII string from Unicode, then there's really no direct way to do so, since the Unicode characters won't necessarily exist in ASCII.
If you're trying to convert to an ASCII string, try one of the following:
Replace the specific unicode chars with ASCII equivalents, if you are only looking to handle a few special cases such as this particular example
Use the unicodedata
module's normalize()
and the string.encode()
method to convert as best you can to the next closest ASCII equivalent (Ref https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python):
>>> teststr
u'I don\xe2\x80\x98t like this'
>>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore')
'I donat like this'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With