Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalizing (webdav) unicode paths

I'm working on a WebDAV implementation for PHP. In order to make it easier for Windows and other operating systems to work together, I need jump through some character encoding hoops.

Windows uses ISO-8859-1 in it's HTTP request, while most other clients encode anything beyond ascii as UTF-8.

My first approach was to ignore this altogether, but I quickly ran into issues when returning urls. I then figured it's probably best to normalize all urls.

Using ü as an example. This will get sent over the wire by OS/X as

u%CC%88 (this is codepoint U+0308)

Windows sents this as:

%FC (latin1)

But, doing a utf8_encode on %FC, I get :

%C3%BC (this is codepoint U+00FC)

Should I treat %C3%BC and u%CC%88 as the same thing? If so.. how? Not touching it seems to work OK for windows. It somehow understands that it's a unicode character, but updating the same file throws an error (for no particular reason).

I'd be happy to provide more information.

like image 965
Evert Avatar asked May 24 '26 00:05

Evert


2 Answers

Mac stores unicode chars as "decomposed", that is, "u" + ¨ (diaresis) instead of "ü". Normalizer can take care of that. If you don't have Normalizer, try iconv('UTF8-MAC', 'UTF8', $str)

like image 56
user187291 Avatar answered May 26 '26 15:05

user187291


I hate answering my own questions, but here goes.

I ended up not bothering. Did extensive research on how various operating systems encode, and handle encodings. Turns out that in most cases other os's handle paths using other normalization forms alright. Windows worked a bit shitty though, but it works.

Whenever I receive a path that's actually non-utf8 altogether, I try to detect the encoding and convert it to UTF-8.

like image 25
Evert Avatar answered May 26 '26 14:05

Evert



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!