I've got about 1000 filenames read by os.listdir()
, some of them are encoded in UTF8 and some are CP1252.
I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?
Example:
for item in os.listdir(rootPath): #Convert to Unicode if isinstance(item, str): item = item.decode('cp1252') # or item = item.decode('utf-8') print item
To detect encoding of the strings you should use detect_str_enc() function. It is vectorized and accepts the character vector. Missing values will be skipped. All strings in R could be only in three encodings - UTF-8 , Latin1 and native .
There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...
You can use type or isinstance . In Python 2, str is just a sequence of bytes. Python doesn't know what its encoding is. The unicode type is the safer way to store text.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.
Use chardet library. It is super easy
import chardet the_encoding = chardet.detect('your string')['encoding']
and that's it!
in python3 you need to provide type bytes or bytearray so:
import chardet the_encoding = chardet.detect(b'your string')['encoding']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With