I am trying to count the lines in a JSON file. Click HERE to access my JSON file .
I tried to use the below code to count the lines.
input = open("json/world_bank.json") i=0 for l in input: i+=1 print(i)
But the above code is throwing a UniCodeDecode Error as shown below.
--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-17-edc88ade7225> in <module>() 2 3 i=0 ----> 4 for l in input: 5 i+=1 6 C:\Users\Subbi Reddy\AppData\Local\Continuum\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final) 21 class IncrementalDecoder(codecs.IncrementalDecoder): 22 def decode(self, input, final=False): ---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0] 24 25 class StreamWriter(Codec,codecs.StreamWriter): UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3979: character maps to <undefined>
Then i included encoding parameter in open function as shown below.
input = open("json/world_bank.json",encoding="utf8")
Then it started working and giving output as 500.
As far as i know python open should consider "utf8" as default encoding.
Where i am going wrong in here.
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.
setdefaultencoding() is purposely removed from sys when Python starts. Reenabling it and changing the default encoding can break code that relies on ASCII being the default (this code can be third-party, which would generally make fixing it impossible or dangerous).
String Encoding Since Python 3.0, strings are stored as Unicode, i.e. each character in the string is represented by a code point. So, each string is just a sequence of Unicode code points. For efficient storage of these strings, the sequence of code points is converted into a set of bytes.
Python bytes decode() function is used to convert bytes to string object. Both these functions allow us to specify the error handling scheme to use for encoding/decoding errors. The default is 'strict' meaning that encoding errors raise a UnicodeEncodeError.
The default UTF-8 encoding of Python 3 only extends to byte->str conversions. open()
instead uses your environment to choose an appropriate encoding:
From the Python 3 docs for open()
:
encoding
is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.
In your case, as you're on Windows with a Western Europe/North America, you will be given the 8bit Windows-1252 character set. Setting encoding
to utf-8
overrides this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With