I have a webpage that accepts CSV files. These files may be created in a variety of places. (I think) there is no way to specify the encoding in a CSV file - so I can not reliably treat all of them as utf-8 or any other encoding.
Is there a way to intelligently guess the encoding of the CSV I am getting? I am working with Python, but willing to work with language agnostic methods too.
One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).
The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as "Â " .
This allows the computer to display the characters properly. Without the proper encoding, the computer will not be able to make sense of the characters and display the proper information.
There is no correct way to determine the encoding of a file by looking at only the file itself, but you can use some heuristics-based solution, eg.: chardet
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With