Description. Removes special characters that are illegal in filenames on certain operating systems and special characters requiring special escaping to manipulate at the command line.
Supported characters for a file name are letters, numbers, spaces, and ( ) _ - , . *Please note file names should be limited to 100 characters. Characters that are NOT supported include, but are not limited to: @ $ % & \ / : * ? " ' < > | ~ ` # ^ + = { } [ ] ; !
Making a small adjustment to Tor Valamo's solution to fix the problem noticed by Dominic Rodger, you could use:
// Remove anything which isn't a word, whitespace, number
// or any of the following caracters -_~,;[]().
// If you don't need to handle multi-byte characters
// you can use preg_replace rather than mb_ereg_replace
// Thanks @Łukasz Rysiak!
$file = mb_ereg_replace("([^\w\s\d\-_~,;\[\]\(\).])", '', $file);
// Remove any runs of periods (thanks falstro!)
$file = mb_ereg_replace("([\.]{2,})", '', $file);
This is how you can sanitize for a file system as asked
function filter_filename($name) {
// remove illegal file system characters https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
$name = str_replace(array_merge(
array_map('chr', range(0, 31)),
array('<', '>', ':', '"', '/', '\\', '|', '?', '*')
), '', $name);
// maximise filename length to 255 bytes http://serverfault.com/a/9548/44086
$ext = pathinfo($name, PATHINFO_EXTENSION);
$name= mb_strcut(pathinfo($name, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($name)) . ($ext ? '.' . $ext : '');
return $name;
}
Everything else is allowed in a filesystem, so the question is perfectly answered...
... but it could be dangerous to allow for example single quotes '
in a filename if you use it later in an unsafe HTML context because this absolutely legal filename:
' onerror= 'alert(document.cookie).jpg
becomes an XSS hole:
<img src='<? echo $image ?>' />
// output:
<img src=' ' onerror= 'alert(document.cookie)' />
Because of that, the popular CMS software Wordpress removes them, but they covered all relevant chars only after some updates:
$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", "%", "+", chr(0));
// ... a few rows later are whitespaces removed as well ...
preg_replace( '/[\r\n\t -]+/', '-', $filename )
Finally their list includes now most of the characters that are part of the URI rerserved-characters and URL unsafe characters list.
Of course you could simply encode all these chars on HTML output, but most developers and me too, follow the idiom "Better safe than sorry" and delete them in advance.
So finally I would suggest to use this:
function filter_filename($filename, $beautify=true) {
// sanitize filename
$filename = preg_replace(
'~
[<>:"/\\|?*]| # file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
[\x00-\x1F]| # control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
[\x7F\xA0\xAD]| # non-printing characters DEL, NO-BREAK SPACE, SOFT HYPHEN
[#\[\]@!$&\'()+,;=]| # URI reserved https://www.rfc-editor.org/rfc/rfc3986#section-2.2
[{}^\~`] # URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
~x',
'-', $filename);
// avoids ".", ".." or ".hiddenFiles"
$filename = ltrim($filename, '.-');
// optional beautification
if ($beautify) $filename = beautify_filename($filename);
// maximize filename length to 255 bytes http://serverfault.com/a/9548/44086
$ext = pathinfo($filename, PATHINFO_EXTENSION);
$filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($filename)) . ($ext ? '.' . $ext : '');
return $filename;
}
Everything else that does not cause problems with the file system should be part of an additional function:
function beautify_filename($filename) {
// reduce consecutive characters
$filename = preg_replace(array(
// "file name.zip" becomes "file-name.zip"
'/ +/',
// "file___name.zip" becomes "file-name.zip"
'/_+/',
// "file---name.zip" becomes "file-name.zip"
'/-+/'
), '-', $filename);
$filename = preg_replace(array(
// "file--.--.-.--name.zip" becomes "file.name.zip"
'/-*\.-*/',
// "file...name..zip" becomes "file.name.zip"
'/\.{2,}/'
), '.', $filename);
// lowercase for windows/unix interoperability http://support.microsoft.com/kb/100625
$filename = mb_strtolower($filename, mb_detect_encoding($filename));
// ".file-name.-" becomes "file-name"
$filename = trim($filename, '.-');
return $filename;
}
And at this point you need to generate a filename if the result is empty and you can decide if you want to encode UTF-8 characters. But you do not need that as UTF-8 is allowed in all file systems that are used in web hosting contexts.
The only thing you have to do is to use urlencode()
(as you hopefully do it with all your URLs) so the filename საბეჭდი_მანქანა.jpg
becomes this URL as your <img src>
or <a href>
:
http://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90%E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg
Stackoverflow does that, so I can post this link as a user would do it:
http://www.maxrev.de/html/img/საბეჭდი_მანქანა.jpg
So this is a complete legal filename and not a problem as @SequenceDigitale.com mentioned in his answer.
SOLUTION 1 - simple and effective
$file_name = preg_replace( '/[^a-z0-9]+/', '-', strtolower( $url ) );
[^a-z0-9]+
will ensure, the filename only keeps letters and numbers'-'
keeps the filename readableExample:
URL: http://stackoverflow.com/questions/2021624/string-sanitizer-for-filename
File: http-stackoverflow-com-questions-2021624-string-sanitizer-for-filename
SOLUTION 2 - for very long URLs
You want to cache the URL contents and just need to have unique filenames. I would use this function:
$file_name = md5( strtolower( $url ) )
this will create a filename with fixed length. The MD5 hash is in most cases unique enough for this kind of usage.
Example:
URL: https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS/ref=s9_nwrsa_gw_g318_i10_r?_encoding=UTF8&fpl=fresh&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_t=36701&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_i=desktop
File: 51301f3edb513f6543779c3a5433b01c
What about using rawurlencode() ? http://www.php.net/manual/en/function.rawurlencode.php
Here is a function that sanitize even Chinese Chars:
public static function normalizeString ($str = '')
{
$str = strip_tags($str);
$str = preg_replace('/[\r\n\t ]+/', ' ', $str);
$str = preg_replace('/[\"\*\/\:\<\>\?\'\|]+/', ' ', $str);
$str = strtolower($str);
$str = html_entity_decode( $str, ENT_QUOTES, "utf-8" );
$str = htmlentities($str, ENT_QUOTES, "utf-8");
$str = preg_replace("/(&)([a-z])([a-z]+;)/i", '$2', $str);
$str = str_replace(' ', '-', $str);
$str = rawurlencode($str);
$str = str_replace('%', '-', $str);
return $str;
}
Here is the explaination
OK, some filename will not be releavant but in most case it will work.
ex. Original Name: "საბეჭდი-და-ტიპოგრაფიული.jpg"
Output Name: "-E1-83-A1-E1-83-90-E1-83-91-E1-83-94-E1-83-AD-E1-83-93-E1-83-98--E1-83-93-E1-83-90--E1-83-A2-E1-83-98-E1-83-9E-E1-83-9D-E1-83-92-E1-83-A0-E1-83-90-E1-83-A4-E1-83-98-E1-83-A3-E1-83-9A-E1-83-98.jpg"
It's better like that than an 404 error.
Hope that was helpful.
Carl.
Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z
, 0-9
, _
, and a single instance of a period (.
). That's obviously more limiting than most filesystems, but should keep you safe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With