I have a raw text file containing only the following line, and no newline:
Q853 \u0410\u043D\u0434\u0440\u0435\u0439 \u0410\u0440\u0441\u0435\u043D\u044C\u0435\u0432\u0438\u0447 \u0422\u0430\u0440\u043A\u043E\u0432\u0441\u043A\u0438\u0439
The characters are escaped as shown above, meaning that the \u05E9
is really a backslash, followed by 5 alphanumeric characters (and not an Unicode character). I am trying to decode the file using the following code:
import codecs
with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input:
with open("wikidata-terms3.nt", "w") as output:
for line in input:
output.write(line)
Using print
is not possible here, see in the comments.
Running it gives me the following error:
Traceback (most recent call last):
File "terms2.py", line 5, in <module>
for line in input:
File "C:\Program Files\Python35\lib\codecs.py", line 711, in __next__
return next(self.reader)
File "C:\Program Files\Python35\lib\codecs.py", line 642, in __next__
line = self.readline()
File "C:\Program Files\Python35\lib\codecs.py", line 555, in readline
data = self.read(readsize, firstline=True)
File "C:\Program Files\Python35\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 67-71: truncated \uXXXX escape
What is going on?
I am running Python 3.5.1 on Windows 8.1, and the code seems to work for most other Unicode characters (this line is the first one to cause the crash).
See edit history for the original question.
It seems that the data read by the decoder is truncated at (after) character#72 (0-based character #71). That obviously is somehow related to the this bug.
The following code produces the same error as in your example:
open("wikidata-terms20.nt", 'r').readline()
open("wikidata-terms20.nt", 'r').readline(72)
Increasing the readline size above the actual size of the input or setting it to -1 eliminates the error:
open("wikidata-terms20.nt", 'r').readline(1000)
open("wikidata-terms20.nt", 'r').readline(-1)
Evidently, for line in input:
obtains the line to be decoded with readline()
, effectively truncating the data-to-be-decoded to 72 characters.
So here are a couple of workarounds:
Workaround 1:
import codecs
with open("wikidata-terms20.nt", 'r') as input:
with open("wikidata-terms3.nt", "w") as output:
for line in input:
output.write(codecs.decode(line, 'unicode_escape'))
Workaround 2:
import codecs
with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input:
with open("wikidata-terms3.nt", "w") as output:
for line in input.readlines():
output.write(line)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With