Character detection in a text file in Python using the Universal Encoding Detector (chardet)

Tags:

I am trying to use the Universal Encoding Detector (chardet) in Python to detect the most probable character encoding in a text file ('infile') and use that in further processing.

While chardet is designed primarily for detecting the character encoding of webpages, I have found an example of it being used on individual text files.

However, I cannot work out how to tell the script to set the most likely character encoding to the variable 'charenc' (which is used several times throughout the script).

My code, based on a combination of the aforementioned example and chardet's own documentation is as follows:

import chardet     rawdata=open(infile,"r").read() chardet.detect(rawdata)

Character detection is necessary as the script goes on to run the following (as well as several similar uses):

inF=open(infile,"rb") s=unicode(inF.read(),charenc) inF.close()

Any help would be greatly appreciated.

960

asked Jul 24 '10 04:07

木川炎星

1 Answers

chardet.detect() returns a dictionary which provides the encoding as the value associated with the key 'encoding'. So you can do this:

import chardet     rawdata = open(infile, 'rb').read() result = chardet.detect(rawdata) charenc = result['encoding']

The chardet documentation is not explicitly clear about whether text strings and/or byte strings are supposed to work with the module, but it stands to reason that if you have a text string you don't need to run character detection on it, so you should probably be passing byte strings. Hence the binary mode flag (b) in the call to open(). But chardet.detect() might also work with a text string depending on which versions of Python and of the library you're using, i.e. if you do omit the b you might find that it works anyway even though you're technically doing something wrong.

158

answered Oct 10 '22 19:10

David Z

Related questions
                            
                                window.open not working in IE
                            
                                Perforce File Locked By Departed User
                            
                                jQuery: Select only a class containing a string?
                            
                                How to parse a JSON string in Delphi?
                            
                                WCF How much faster is TCP than HTTP
                            
                                Concat strings by & and + in VB.Net
                            
                                How to match only strings that do not contain a dot (using regular expressions)
                            
                                Hiding keyboard when clear button is pressed in UITextField
                            
                                What port number does SOAP use?
                            
                                Mobile Safari Viewport - Preventing Horizontal Scrolling?
                            
                                How can I decrypt MySQL passwords
                            
                                Compare data of two Excel Columns A & B, and show data of Column A that do not exist in B [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Character detection in a text file in Python using the Universal Encoding Detector (chardet)

Tags:

木川炎星

People also ask

1 Answers

David Z

Recent Activity

Donate For Us

Character detection in a text file in Python using the Universal Encoding Detector (chardet)

Tags:

木川 炎星

People also ask

1 Answers

David Z

Related questions

Recent Activity

Donate For Us

木川炎星