Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String slugification in Python

Tags:

python

slug

There is a python package named python-slugify, which does a pretty good job of slugifying:

pip install python-slugify

Works like this:

from slugify import slugify

txt = "This is a test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = "This -- is a ## test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = 'C\'est déjà l\'été.'
r = slugify(txt)
self.assertEquals(r, "cest-deja-lete")

txt = 'Nín hǎo. Wǒ shì zhōng guó rén'
r = slugify(txt)
self.assertEquals(r, "nin-hao-wo-shi-zhong-guo-ren")

txt = 'Компьютер'
r = slugify(txt)
self.assertEquals(r, "kompiuter")

txt = 'jaja---lol-méméméoo--a'
r = slugify(txt)
self.assertEquals(r, "jaja-lol-mememeoo-a")

See More examples

This package does a bit more than what you posted (take a look at the source, it's just one file). The project is still active (got updated 2 days before I originally answered, over seven years later (last checked 2020-06-30), it still gets updated).

careful: There is a second package around, named slugify. If you have both of them, you might get a problem, as they have the same name for import. The one just named slugify didn't do all I quick-checked: "Ich heiße" became "ich-heie" (should be "ich-heisse"), so be sure to pick the right one, when using pip or easy_install.


Install unidecode form from here for unicode support

pip install unidecode

# -*- coding: utf-8 -*-
import re
import unidecode

def slugify(text):
    text = unidecode.unidecode(text).lower()
    return re.sub(r'[\W_]+', '-', text)

text = u"My custom хелло ворлд"
print slugify(text)

>>> my-custom-khello-vorld


There is python package named awesome-slugify:

pip install awesome-slugify

Works like this:

from slugify import slugify

slugify('one kožušček')  # one-kozuscek

awesome-slugify github page


It works well in Django, so I don't see why it wouldn't be a good general purpose slugify function.

Are you having any problems with it?


The problem is with the ascii normalization line:

slug = unicodedata.normalize('NFKD', s)

It is called unicode normalization which does not decompose lots of characters to ascii. For example, it would strip non-ascii characters from the following strings:

Mørdag -> mrdag
Æther -> ther

A better way to do it is to use the unidecode module that tries to transliterate strings to ascii. So if you replace the above line with:

import unidecode
slug = unidecode.unidecode(s)

You get better results for the above strings and for many Greek and Russian characters too:

Mørdag -> mordag
Æther -> aether

def slugify(value):
    """
    Converts to lowercase, removes non-word characters (alphanumerics and
    underscores) and converts spaces to hyphens. Also strips leading and
    trailing whitespace.
    """
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub('[^\w\s-]', '', value).strip().lower()
    return mark_safe(re.sub('[-\s]+', '-', value))
slugify = allow_lazy(slugify, six.text_type)

This is the slugify function present in django.utils.text This should suffice your requirement.