Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:
A slug string typically contains only of the characters a-z
, 0-9
and -
and can hence be written without URL-escaping (think "foo%20bar").
I'm looking for a Perl slug function that given any valid Unicode string will return a slug representation (a-z
, 0-9
and -
).
A super trivial slug function would be something along the lines of:
$input = lc($input),
$input =~ s/[^a-z0-9-]//g;
However, this implementation would not handle internationalization and accents (I want ë
to become e
). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.
My question:
A URL is a web address, and the slug is the part at the end that identifies the exact web page the URL points to. For example, “product-name” is the slug in www.ecommerce.com/category/product-name/. Like many page builders, a slug in WordPress defaults to the name of the web page.
A slug is the part of a URL that identifies a particular page on a website in an easy-to-read form. In other words, it's the part of the URL that explains the page's content. For this article, for example, the URL is https://yoast.com/slug, and the slug simply is 'slug'.
What is a slug? Well, the name “slug” comes from web publishing, and refers usually to part of a URL which identifies a page or resource. The name is based on the use of the word slug in the news media to indicate a short name given to an article for internal use.
The slugify
filter currently used in Django translates (roughly) to the following Perl code:
use Unicode::Normalize;
sub slugify($) {
my ($input) = @_;
$input = NFKD($input); # Normalize (decompose) the Unicode string
$input =~ tr/\000-\177//cd; # Strip non-ASCII characters (>127)
$input =~ s/[^\w\s-]//g; # Remove all characters that are not word characters (includes _), spaces, or hyphens
$input =~ s/^\s+|\s+$//g; # Trim whitespace from both ends
$input = lc($input);
$input =~ s/[-\s]+/-/g; # Replace all occurrences of spaces and hyphens with a single hyphen
return $input;
}
Since you also want to change accented characters to unaccented ones, throwing in a call to unidecode
(defined in Text::Unidecode
) before stripping the non-ASCII characters seems to be your best bet (as pointed out by phaylon).
In that case, the function could look like:
use Unicode::Normalize;
use Text::Unidecode;
sub slugify_unidecode($) {
my ($input) = @_;
$input = NFC($input); # Normalize (recompose) the Unicode string
$input = unidecode($input); # Convert non-ASCII characters to closest equivalents
$input =~ s/[^\w\s-]//g; # Remove all characters that are not word characters (includes _), spaces, or hyphens
$input =~ s/^\s+|\s+$//g; # Trim whitespace from both ends
$input = lc($input);
$input =~ s/[-\s]+/-/g; # Replace all occurrences of spaces and hyphens with a single hyphen
return $input;
}
The former works well for strings that are primarily ASCII, but falls short when the entire string is formed of non-ASCII characters, since they all get stripped out, leaving you with an empty string.
Sample output:
string | slugify | slugify_unidecode
-------------------------------------------------
hello world hello world hello world
北亰 bei-jing
liberté liberta liberte
Note how 北亰 gets slugifies to nothing with the Django-inspired implementation. Note also the difference the NFC normalization makes -- liberté becomes 'liberta' with NFKD after stripping out the second part of the decomposed character, but would becomes 'libert' after stripping out the re-assembled 'é' with NFC.
Are you looking for something like Text::Unidecode?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With