Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

trouble with utf-8 chars & apache2 rewrite rules

I see the post validating utf-8 in htaccess rewrite rule and I think that is great, but a more fundamental problem I am having first:

I needed to expand to handle utf-8 chars for query string parameters, names of directories, files, and used in displays to users etc.

I configured my Apache with DefaultCharset utf-8 and also my php if that matters. My original rewrite rule filtered everything except regular A-Za-z and underscore and hyphen. and it worked. Anything else would give you a 404 (which is what I want!) Now, however it seems that everything matches, including stuff I don't want, however, although it seems to match it doesn't go in the query string unless it is a regular A-Za-z_- character string.

I find this confusing, because the rule says put whatever you matched into the query string:

Here is the original rule:

RewriteRule ^/puzzle/([A-Za-z_-]+)$ /puzzle.php?g=$1 [NC]

and here is the revised rule:

RewriteRule ^/puzzle/(\w+)$ /puzzle.php?g=$1 [NC]

I made the change because somewhere I read that \w matches ALL the alpha chars where as A-Zetc. only matches the ones without accents and stuff.

It doesn't seem to matter which of those rules I use: Here is what happens:

In the application I have this:

echo $_GET['g'];

If I feed it a url like http://mydomain.com/puzzle/USA it echoes out "USA" and works fine.
If I feed it a url like http://mydomain.com/puzzle/México it echoes nothing for that and warns me that index g is not defined and of course doesn't get resources for Mexico.
if I feed it a url like http://mydomain.com/puzzle/fuzzle/buzzle/j.qle it does the same thing.
This last case should be a 404!

And it does this no matter which of the above rules I use. I configured a rewrite log

   RewriteLogLevel 5
   RewriteLog /opt/local/apache2/logs/puzzles.httpd.rewrite

but it is empty.

Here is from the regular access log (it gives a status of 200)

[26/May/2010:11:21:42 -0700] "GET /puzzle/M%C3%A9xico HTTP/1.1" 200 342
[26/May/2010:11:21:54 -0700] "GET /puzzle/M/l.foo HTTP/1.1" 200 342

What can I do to get these $%#$@(*#@!!! characters but not slash, dot or other non-alpha into my program, and once there, will it decode them correctly??? Would posix char classes work any better? Is there anything else I need to configure?

like image 623
Colleen Kitchen Avatar asked May 26 '10 19:05

Colleen Kitchen


People also ask

Which characters are not supported by UTF-8?

0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.

What is UTF-8 error?

This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.

Can UTF-8 support all characters?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.


1 Answers

I'd suggest you activate MultiViews and forget mod_rewrite. Add to your apache configuration in the relevant Directory/VirtualHost section:

Options +MultiViews
#should already be set to this, but it doesn't hurt:
AcceptPathInfo Default

No you can always omit the extensions as long as the client includes the correspondent mime type in its Accept header.

Now a request for /puzzle/whatever will map to /puzzle.php and $_SERVER['PATH_INFO'] will be filled with /whatever.


If you want to do it with mod_rewrite it's also possible. The test string for RewriteRule is unescaped (the %xx portions are converted to the actual bytes they represent). You can get the original escaped string using %{REQUEST_URI} or %{THE_REQUEST} (the last one also contains the HTTP method and version).

By convention, web browsers use UTF-8 encoding in URLs. This means that "México" will be urlencoded to M%C2%82xico, not M%82xico, which would be expected if the browsers used ISO-8859-1. Also, [a-zA-Z] will not match é. However, this should work:

RewriteCond %{REQUEST_URI} ^/puzzle/[^/]*$
RewriteRule ^/puzzle/(.*)$ /puzzle.php?q=$1 [B,L]

You need B to escape the backreference because you're using it in a query string, in which the set of characters that are allowed is smaller than for the rest of the URI.

The thing you should be aware of is that RewriteRule is not unicode aware. Anything other than .* can give (potentially) incorrect results. Even [^/] may not work because the / "character" (read: byte) may be part of a multi-byte character sequence. If RewriteRule were unicode aware, your solution with \w should work.

Since you do not want to match subdirectories, and RewriteRule ^/puzzle/[^/]* is not an option, that check is deferred to a RewriteCond that uses the (escaped) %{REQUEST_URI}.

like image 88
Artefacto Avatar answered Oct 19 '22 23:10

Artefacto