Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wordpress/Apache - 404 error with unicode characters in image filenames

We've recently moved a website to a new server, and are running into an odd issue where some uploaded images with unicode characters in the filename are giving us a 404 error.

Via ssh/FTP, we can see that the files are definitely there.

For example:

http://sjofasting.no/project/adnoy

none of the images are working:

Code:

<img class='image-display' title='' src='http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg' width='685' height='484'/>

SSH:

-rw-r--r-- 1 xxxxxxxx xxxxxxxx 836813 Aug 3 16:12 ådnøy_1_2.jpg

What is also strange is that if you navigate to the directory you can even click on the image and it works:

http://sjofasting.no/wp/wp-content/uploads/2012/03/

click on 'ådnøy_1_2.jpg' and it works.

Somehow wordpress is generating

http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg

and copying from the direct folder browse is generating

http://sjofasting.no/wp/wp-content/uploads/2012/03/a%CC%8Adn%C3%B8y_1_2.jpg

What is going on??


edit:

If I copy the image url from the wordpress source I get:

http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellg%C3%A5rd-12.jpg

When copied from the apache browser I get:

http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellga%cc%8ard-12.jpg

What could account for this discrepancy between: %C3%A5 and %cc%8

??

like image 656
waffl Avatar asked Aug 30 '12 15:08

waffl


1 Answers

Unicode normalisation.

0xC3 0xA5 is the UTF-8 encoding for U+00E5 a-with-ring.

0xCC 0x8A is the UTF-8 encoding for U+030A combining ring.

U+0035 is the composed (Normal Form C) way of writing an a-ring; an a letter followed by U+030A is the decomposed (Normal Form D) way of writing it. å vs å - they should look the same, though they may differ slightly depending on font rendering.

Now normally it doesn't really matter which one you've got because sensible filesystems leave them untouched. If you save a file called [char U+00E5].txt (å.txt), it stays called that under Windows and Linux.

Macs, on the other hand, are insane. The filesystem prefers Normal Form D, to the extent that any composed characters you pass into it get converted into decomposed ones. If you put a file in called [char U+00E5].txt and immediately list the directory, you'll find you've actually got a file called a[char U+030A].txt. You can still access the file as [char U+00E5].txt on a Mac because it'll convert that input into Normal Form D too before looking it up, but you cannot recover the same filename in character sequence terms as you put in: it's a lossy conversion.

So if you save your files on a Mac and then transfer to a filesystem where [char U+00E5].txt and a[char U+030A].txt refer to different files, you will get broken links.

Update the pages to point to the Normal Form D versions of the URLs, or re-upload the files from a filesystem that doesn't egregiously mangle Unicode characters.

Think Different, Cause Bizarre Interoperability Problems.

like image 78
bobince Avatar answered Nov 07 '22 04:11

bobince