How to print Unicode string in Python?

Now if you simply want to print the unicode string prettily, just use unicode's encode method: To make sure that every line from any file would be read as unicode, you'd better use the codecs.open function instead of just open, which allows you to specify file's encoding:

How do you deal with Bom characters?

The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark, so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:

Is there a way to convert Unicode to ASCII in Python?

EDIT: I'm assuming that your intended goal is just to be able to read the file properly into a string in Python. If you're trying to convert to an ASCII string from Unicode, then there's really no direct way to do so, since the Unicode characters won't necessarily exist in ASCII.

Reading Unicode file data with BOM chars in Python

Q: Can chardet detect bomb in text?

Note: chardet may return 'UTF-XXLE', 'UTF-XXBE' encodings that leave the BOM in the text. 'LE', 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in @ivan_pozdeev's answer.

Tags:

python

unicode

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'

# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig encoding. You could try something like this:

import io
import chardet
import codecs

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)

if raw.startswith(codecs.BOM_UTF8):
    encoding = 'utf-8-sig'
else:
    result = chardet.detect(raw)
    encoding = result['encoding']

infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()

print(data)

I've composed a nifty BOM-based detector based on Chewie's answer. It's sufficient in the common use case where data can be either in a known local encoding or Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet, it doesn't do any random guessing, so it gives predictable results:

def detect_by_bom(path, default):
    with open(path, 'rb') as f:
        raw = f.read(4)    # will read less if the file is smaller
    # BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
    for enc, boms in \
            ('utf-8-sig', (codecs.BOM_UTF8,)), \
            ('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
            ('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
        if any(raw.startswith(bom) for bom in boms):
            return enc
    return default

chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:

#!/usr/bin/env python
import chardet # $ pip install chardet

# detect file encoding
with open(filename, 'rb') as file:
    raw = file.read(32) # at most 32 bytes are returned
    encoding = chardet.detect(raw)['encoding']

with open(filename, encoding=encoding) as file:
    text = file.read()
print(text)

Note: chardet may return 'UTF-XXLE', 'UTF-XXBE' encodings that leave the BOM in the text. 'LE', 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in @ivan_pozdeev's answer.

To avoid UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.

Related questions
                            
                                Python: "TypeError: __str__ returned non-string" but still prints to output?
                            
                                Making SVM run faster in python
                            
                                OpenCV not working properly with python on Linux with anaconda. Getting error that cv2.imshow() is not implemented
                            
                                Double Progress Bar in Python
                            
                                How to split data into trainset and testset randomly?
                            
                                Python: Pass or Sleep for long running processes?
                            
                                trying to install pymssql on ubuntu 12.04 using pip
                            
                                Python version 2.6 required, which was not found in the registry
                            
                                Profiling in Python: Who called the function?
                            
                                python tracing a segmentation fault
                            
                                Limit number of characters with Django Template filter
                            
                                add columns different length pandas
                            
                                Popen error: [Errno 2] No such file or directory
                            
                                pandas comparison raises TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]
                            
                                Python webbrowser.open() to open Chrome browser
                            
                                How to add the current query string to an URL in a Django template?
                            
                                'True' and 'False' in Python
                            
                                Escape double quotes for JSON in Python
                            
                                How do I get the value of a tensor in PyTorch?
                            
                                Stream large binary files with urllib2 to file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading Unicode file data with BOM chars in Python

Tags:

python

unicode

Related questions

Recent Activity

Donate For Us