Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the encoding format of nginx's access log?

Tags:

nginx

What is the encoding of Nginx's access.log? I'm trying to iterate through the file but my script raises an exception "invalid byte sequence in UTF-8" when it sees requests to my server that have Chinese/Thai characters in them.

like image 344
Tom Avatar asked Dec 12 '25 21:12

Tom


1 Answers

An HTTP protocol request is mostly ASCII, with data fields allowed to be any octets. See Which encoding is used by the HTTP protocol? .

Nginx's error and access logs will reflect that.

My experience with 450,000 records from a small North American site revealed that all but one record decoded to ASCII without error. That record contained 4 consecutive bytes (b'\xb8E\x8c\xde') that were invalid UTF-8 but were valid big5hkscs (Python's codec for Traditional Chinese) that produced two glyphs.

See How can I programmatically find the list of codecs known to Python? to get a list of codec names for a brute force attack on non-ASCII bits.

Decoding the binary log records to ASCII, with '?' replacements for errors, was good enough for my needs.

like image 143
jkellydresser Avatar answered Dec 16 '25 21:12

jkellydresser



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!