Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I generate URL slugs in Perl?

Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:

  • Slugs in Rails
  • Slugs in Django

A slug string typically contains only of the characters a-z, 0-9 and - and can hence be written without URL-escaping (think "foo%20bar").

I'm looking for a Perl slug function that given any valid Unicode string will return a slug representation (a-z, 0-9 and -).

A super trivial slug function would be something along the lines of:

$input = lc($input),
$input =~ s/[^a-z0-9-]//g;

However, this implementation would not handle internationalization and accents (I want ë to become e). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.

My question:

  • What is the most general/practical way to generate Django/Rails type slugs in Perl? This is how I solved the same problem in Java.
like image 552
knorv Avatar asked Oct 24 '10 16:10

knorv


People also ask

What does a URL slug look like?

A URL is a web address, and the slug is the part at the end that identifies the exact web page the URL points to. For example, “product-name” is the slug in www.ecommerce.com/category/product-name/. Like many page builders, a slug in WordPress defaults to the name of the web page.

Is slug the same as URL?

A slug is the part of a URL that identifies a particular page on a website in an easy-to-read form. In other words, it's the part of the URL that explains the page's content. For this article, for example, the URL is https://yoast.com/slug, and the slug simply is 'slug'.

Why is a URL called a slug?

What is a slug? Well, the name “slug” comes from web publishing, and refers usually to part of a URL which identifies a page or resource. The name is based on the use of the word slug in the news media to indicate a short name given to an article for internal use.


2 Answers

The slugify filter currently used in Django translates (roughly) to the following Perl code:

use Unicode::Normalize;

sub slugify($) {
    my ($input) = @_;

    $input = NFKD($input);         # Normalize (decompose) the Unicode string
    $input =~ tr/\000-\177//cd;    # Strip non-ASCII characters (>127)
    $input =~ s/[^\w\s-]//g;       # Remove all characters that are not word characters (includes _), spaces, or hyphens
    $input =~ s/^\s+|\s+$//g;      # Trim whitespace from both ends
    $input = lc($input);
    $input =~ s/[-\s]+/-/g;        # Replace all occurrences of spaces and hyphens with a single hyphen

    return $input;
}

Since you also want to change accented characters to unaccented ones, throwing in a call to unidecode (defined in Text::Unidecode) before stripping the non-ASCII characters seems to be your best bet (as pointed out by phaylon).

In that case, the function could look like:

use Unicode::Normalize;
use Text::Unidecode;

sub slugify_unidecode($) {
    my ($input) = @_;

    $input = NFC($input);          # Normalize (recompose) the Unicode string
    $input = unidecode($input);    # Convert non-ASCII characters to closest equivalents
    $input =~ s/[^\w\s-]//g;       # Remove all characters that are not word characters (includes _), spaces, or hyphens
    $input =~ s/^\s+|\s+$//g;      # Trim whitespace from both ends
    $input = lc($input);
    $input =~ s/[-\s]+/-/g;        # Replace all occurrences of spaces and hyphens with a single hyphen

    return $input;
}

The former works well for strings that are primarily ASCII, but falls short when the entire string is formed of non-ASCII characters, since they all get stripped out, leaving you with an empty string.

Sample output:

string        | slugify       | slugify_unidecode
-------------------------------------------------
hello world     hello world     hello world
北亰                            bei-jing
liberté         liberta         liberte

Note how 北亰 gets slugifies to nothing with the Django-inspired implementation. Note also the difference the NFC normalization makes -- liberté becomes 'liberta' with NFKD after stripping out the second part of the decomposed character, but would becomes 'libert' after stripping out the re-assembled 'é' with NFC.

like image 144
Cameron Avatar answered Nov 07 '22 03:11

Cameron


Are you looking for something like Text::Unidecode?

like image 22
phaylon Avatar answered Nov 07 '22 02:11

phaylon