Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Nginx support raw unicode in paths?

Tags:

url

nginx

unicode

Browsers url encode unicode characters to %## by default.

However, I can make a request via CURL to http://localhost:8080/与 and nginx sees the path as "". How is this possible? Does Nginx allow arbitrary unicode in it's path then?

For example, with this config I can set an additional header to see what nginx saw:

location ~* "(*UTF8)([^\w/\.\-\\% ])" {
        add_header "response" $1;
        return 200;
}

Request:

* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /与 HTTP/1.1
> User-Agent: curl/7.30.0
> Host: localhost:8080
> Accept: */*
> 
< HTTP/1.1 200 OK
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Tue, 20 Jan 2015 21:44:51 GMT
< Content-Type: application/octet-stream
< Content-Length: 0
< Connection: keep-alive
< response: 与                                        <--- SEE THIS?
< 
* Connection #0 to host localhost left intact

However, when I remove the UTF8 marker then the header contains "?" as if nginx can't understand the character (or is only reading the first byte).

location ~* "([^\w/\.\-\\% ])" {
        add_header "response" $1;
        return 200;
}

Request:

* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /与 HTTP/1.1
> User-Agent: curl/7.30.0
> Host: localhost:8080
> Accept: */*
> 
< HTTP/1.1 200 OK
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Tue, 20 Jan 2015 21:45:35 GMT
< Content-Type: application/octet-stream
< Content-Length: 0
< Connection: keep-alive
< response: ?
< 
* Connection #0 to host localhost left intact

Note: Changing this non-utf-8 regex to capture one-or-more ([^...]+) also results in the response: 与 header being sent (byte vs multibyte strings?)

Logging either regex match to a file results in an request entry like:

GET /\xE4\xB8\x8E HTTP/1.1
like image 861
Xeoncross Avatar asked Jan 20 '15 21:01

Xeoncross


2 Answers

Apart from the regexes and terminal configuration, this doesn't have anything to do with Unicode. The short answer to your question is: nginx doesn't care about Unicode encodings but it does accept non-ASCII bytes in URLs.

Here's the long answer that explains what you're seeing. If you enter the command

curl http://localhost:8080/与

and your terminal uses UTF-8 as encoding, it will encode the character 与 (U+4E0E) into the three-byte UTF-8 sequence

0xE4 0xB8 0x8E

curl apparently accepts non-ASCII bytes in URLs, although they're technically illegal. It will then send an HTTP request with these non-ASCII bytes. Since there is no default way to display these bytes, I'll use bolded C-style hex escapes like \x00 from now on to represent them. So the request line sent by curl looks like:

GET /\xE4\xB8\x8E HTTP/1.1

That's three bytes after the first /. If the terminal on which you view your logs also supports UTF-8, this will be displayed on your screen as

GET /与 HTTP/1.1

But this does not mean that there are Unicode characters in your HTTP request. On the HTTP level, we only deal with bytes.

nginx also seems to happily accept non-ASCII bytes in URLs. Then the following regex

(*UTF8)([^\w/\.\-\\% ])

working in UTF-8 mode treats the byte sequence \xE4\xB8\x8E as character 与 which matches \w, so the header will be

response: \xE4\xB8\x8E

which your terminal display as

response: 与

On the other hand, the regex

([^\w/\.\-\\% ])

works directly on bytes, so it will only match the first byte of your path, or nothing at all. For some reason, it thinks that the first byte of the sequence \xE4\xB8\x8E matches \w (maybe because it assumes Latin1 or Windows-1252 strings), so the header will be:

response: \xE4

which your terminal decides to display as

response: ?

because the byte \xE4 followed by a newline is invalid UTF-8. The regex ([^\w/\.\-\\% ])+ matches the whole byte sequence, so it produces the same result as the UTF-8 regex.

If you see something like

GET /\xE4\xB8\x8E HTTP/1.1

in your logs, that's because the authors of the logging code decided to use escape sequence for non-ASCII bytes. In general, this is a good idea because it always produces the same output regardless of terminal configuration and really shows what's going on: Your HTTP request simply contains non-ASCII bytes.

like image 86
nwellnhof Avatar answered Nov 15 '22 20:11

nwellnhof


Doesn't your own testing already seem to answer your question?

Yes, nginx does support Unicode in paths.

As a point of discussion, nginx will normalise URLs prior to location matching, as pointed out in the documentation at http://nginx.org/r/location. Which is why different "weird" requests (like those containing ../; or those encoding ? as %3F, thus making it part of the filename, instead of signifying the parameters known as $args) may still end up being served by a single location that does not look like a one-to-one match to the naked eye.

This normalisation may also explain why the "same" string appears differently within access_log (pre-normalised) vs. error_log (normalised).

like image 24
cnst Avatar answered Nov 15 '22 22:11

cnst