Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to "sniff" the Character encoding?

I have a webpage that accepts CSV files. These files may be created in a variety of places. (I think) there is no way to specify the encoding in a CSV file - so I can not reliably treat all of them as utf-8 or any other encoding.

Is there a way to intelligently guess the encoding of the CSV I am getting? I am working with Python, but willing to work with language agnostic methods too.

like image 695
shabda Avatar asked May 27 '13 10:05

shabda


People also ask

How do you determine character encoding?

One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).

What causes â?

The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as "Â " .

What impact does the encoding have on the page?

This allows the computer to display the characters properly. Without the proper encoding, the computer will not be able to make sense of the characters and display the proper information.


1 Answers

There is no correct way to determine the encoding of a file by looking at only the file itself, but you can use some heuristics-based solution, eg.: chardet

like image 109
asciimoo Avatar answered Oct 03 '22 01:10

asciimoo