As stated in title, I would like to check in given file object (opened as binary stream) is valid UTF-8 file.
Anyone?
Thanks
Could be simpler by using only one line: codecs. open("path/to/file", encoding="utf-8", errors="strict").
To verify if a file passes an encoding such as ascii, iso-8859-1, utf-8 or whatever then a good solution is to use the 'iconv' command.
There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...
Use write() and writelines() methods to write to a text file. Pass the encoding='utf-8' to the open() function to write UTF-8 characters into a file.
def try_utf8(data):
"Returns a Unicode object on success, or None on failure"
try:
return data.decode('utf-8')
except UnicodeDecodeError:
return None
data = f.read()
udata = try_utf8(data)
if udata is None:
# Not UTF-8. Do something else
else:
# Handle unicode data
You could do something like
import codecs
try:
f = codecs.open(filename, encoding='utf-8', errors='strict')
for line in f:
pass
print "Valid utf-8"
except UnicodeDecodeError:
print "invalid utf-8"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With