There are a few threads on stackoverflow, but i couldn't find a valid solution to the problem as a whole.
I have collected huge sums of textual data from the urllib read function and stored the same in pickle files.
Now I want to write this data to a file. While writing i'm getting errors similar to -
'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128)
and a lot of data is being lost.
I suppose the data off the urllib read is byte data
I've tried
1. text=text.decode('ascii','ignore')
2. s=filter(lambda x: x in string.printable, s)
3. text=u''+text
text=text.decode().encode('utf-8')
but still im ending up with similar errors. Can somebody point out a proper solution. And also would codecs strip work. I have no issues if the conflict bytes are not written to the file as a string hence the loss is accepted.
You can do it through smart_str
of Django
module. Just try this:
from django.utils.encoding import smart_str, smart_unicode
text = u'\u2019'
print smart_str(text)
You can install Django by starting a command shell with administrator privileges and run this command:
pip install Django
Your data is unicode data. To write that to a file, use .encode()
:
text = text.encode('ascii', 'ignore')
but that would remove anything that isn't ASCII. Perhaps you wanted to encode to a more suitable encoding, like UTF-8, instead?
You may want to read up on Python and Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With