Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to escape Chinese Unicode characters in URL?

I have Chinese users of my PHP web application who enter products into our system. The information the’re entering is for example a product title and price.

We would like to use the product title to generate a nice URL slug for those product. Seems like we cannot just use Chinese as HREF attributes.

Does anyone know how we handle a title like “婴儿服饰” so that we can generate a clean url like http://www.site.com/婴儿服饰 ?

Everything works fine for “normal” languages, but high UTF‐8 languages give us problems.

Also, when generating the clean URL, we want to keep SEO in mind, but I have no experience with Chinese in that matter.

like image 590
Jorre Avatar asked May 27 '11 13:05

Jorre


2 Answers

If your string is already UTF-8, just use rawurlencode to encode the string properly:

$path = '婴儿服饰';
$url = 'http://example.com/'.rawurlencode($path);

UTF-8 is the preferred character encoding for non-ASCII characters (although only ASCII characters are allowed in URIs which is why you need to use the percent-encoding). The result is the same as in tchrist’s example:

http://example.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0
like image 150
Gumbo Avatar answered Sep 21 '22 21:09

Gumbo


This code, which uses the CPAN module, URI::Escape:

#!/usr/bin/env perl

use v5.10;
use utf8;

use URI::Escape qw(uri_escape_utf8);

my $url  = "http://www.site.com/";
my $path = "婴儿服饰";

say $url, uri_escape_utf8($path);

when run, prints:

http://www.site.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0

Is that what you're looking for?

BTW, those four characters are:

CJK UNIFIED IDEOGRAPH-5A74
CJK UNIFIED IDEOGRAPH-513F
CJK UNIFIED IDEOGRAPH-670D
CJK UNIFIED IDEOGRAPH-9970

Which, according to the Unicode::Unihan database, seems to be yīng ér fú shì, or perhaps just ying er fú shi per Lingua::ZH::Romanize::Pinyin. And maybe even jing¹ jan⁴ fuk⁶ sik¹ or jing˥ jan˨˩ fuk˨ sik˥, using the Cantonese version from Unicode::Unihan.

like image 28
tchrist Avatar answered Sep 19 '22 21:09

tchrist