I have a string that I want to use as a filename, so I want to remove all characters that wouldn't be allowed in filenames, using Python.
I'd rather be strict than otherwise, so let's say I want to retain only letters, digits, and a small set of other characters like "_-.() "
. What's the most elegant solution?
The filename needs to be valid on multiple operating systems (Windows, Linux and Mac OS) - it's an MP3 file in my library with the song title as the filename, and is shared and backed up between 3 machines.
Don't start or end your filename with a space, period, hyphen, or underline. Keep your filenames to a reasonable length and be sure they are under 31 characters. Most operating systems are case sensitive; always use lowercase. Avoid using spaces and underscores; use a hyphen instead.
The String format() Method.
Python Programming Puzzles: Exercise-59 with Solution A valid filename should end in . txt, .exe, . jpg, . png, or . dll, and should have at most three digits, no additional periods.
Supported characters for a file name are letters, numbers, spaces, and ( ) _ - , . *Please note file names should be limited to 100 characters. Characters that are NOT supported include, but are not limited to: @ $ % & \ / : * ?
You can look at the Django framework for how they create a "slug" from arbitrary text. A slug is URL- and filename- friendly.
The Django text utils define a function, slugify()
, that's probably the gold standard for this kind of thing. Essentially, their code is the following.
import unicodedata import re def slugify(value, allow_unicode=False): """ Taken from https://github.com/django/django/blob/master/django/utils/text.py Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated dashes to single dashes. Remove characters that aren't alphanumerics, underscores, or hyphens. Convert to lowercase. Also strip leading and trailing whitespace, dashes, and underscores. """ value = str(value) if allow_unicode: value = unicodedata.normalize('NFKC', value) else: value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii') value = re.sub(r'[^\w\s-]', '', value.lower()) return re.sub(r'[-\s]+', '-', value).strip('-_')
And the older version:
def slugify(value): """ Normalizes string, converts to lowercase, removes non-alpha characters, and converts spaces to hyphens. """ import unicodedata value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore') value = unicode(re.sub('[^\w\s-]', '', value).strip().lower()) value = unicode(re.sub('[-\s]+', '-', value)) # ... return value
There's more, but I left it out, since it doesn't address slugification, but escaping.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With