I'm writing an app that takes some massive amounts of texts as input which could be in any character encoding, and I want to save it all in UTF-8. I won't receive, or can't trust, the character encoding that comes defined with the data (if any).
I have for a while used Pythons library chardet to detect the original character encoding, http://pypi.python.org/pypi/chardet, but ran into some problems lately where I noticed that it doesn't support Scandinavian encodings (for example iso-8859-1). And apart from that, it takes a huge amount of time/CPU/mem to get results. ~40s for a 2MB text file.
I tried just using the standard Linux file
file -bi name.txt
And with all my files so far it provides me with a 100% result. And this with ~0.1s for a 2MB file. And it supports Scandinavian character encodings as well.
So, I guess the advantages with using file is clear. What are the downsides? Am I missing something?
Old MS-DOS and Windows formatted files can be detected as unknown-8bit instead of ISO-8859-X, due to not completely standard encondings. Chardet instead will perform an educated guess, reporting a confidence value.
http://www.faqs.org/faqs/internationalization/iso-8859-1-charset/
If you won't handle old, exotic, out-of-standard text files, I think you can use file -i
without many problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With