What is the encoding of Nginx's access.log? I'm trying to iterate through the file but my script raises an exception "invalid byte sequence in UTF-8" when it sees requests to my server that have Chinese/Thai characters in them.
An HTTP protocol request is mostly ASCII, with data fields allowed to be any octets. See Which encoding is used by the HTTP protocol? .
Nginx's error and access logs will reflect that.
My experience with 450,000 records from a small North American site revealed that all but one record decoded to ASCII without error. That record contained 4 consecutive bytes (b'\xb8E\x8c\xde') that were invalid UTF-8 but were valid big5hkscs (Python's codec for Traditional Chinese) that produced two glyphs.
See How can I programmatically find the list of codecs known to Python? to get a list of codec names for a brute force attack on non-ASCII bits.
Decoding the binary log records to ASCII, with '?' replacements for errors, was good enough for my needs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With