Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect string byte encoding?

I've got about 1000 filenames read by os.listdir(), some of them are encoded in UTF8 and some are CP1252.

I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?

Example:

for item in os.listdir(rootPath):      #Convert to Unicode     if isinstance(item, str):         item = item.decode('cp1252')  # or item = item.decode('utf-8')     print item 
like image 945
Philipp Avatar asked Apr 10 '13 06:04

Philipp


People also ask

How do I know the encoding of a string?

To detect encoding of the strings you should use detect_str_enc() function. It is vectorized and accepts the character vector. Missing values will be skipped. All strings in R could be only in three encodings - UTF-8 , Latin1 and native .

How do I know if my file is UTF-16 or UTF-8?

There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...

How do you check if a string is encoded Python?

You can use type or isinstance . In Python 2, str is just a sequence of bytes. Python doesn't know what its encoding is. The unicode type is the safer way to store text.

How many bytes is a string in UTF-8?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.


1 Answers

Use chardet library. It is super easy

import chardet  the_encoding = chardet.detect('your string')['encoding'] 

and that's it!

in python3 you need to provide type bytes or bytearray so:

import chardet the_encoding = chardet.detect(b'your string')['encoding'] 
like image 72
george Avatar answered Sep 18 '22 15:09

george