URL sanitization means exactly what you think it means. URL clean up. But why would a URL need cleaning up? Doesn't it mean that we won't arrive to the intended website if we cut some parts of the URL? Let me explain.
I found this larger function in the Chyrp code:
/**
* Function: sanitize
* Returns a sanitized string, typically for URLs.
*
* Parameters:
* $string - The string to sanitize.
* $force_lowercase - Force the string to lowercase?
* $anal - If set to *true*, will remove all non-alphanumeric characters.
*/
function sanitize($string, $force_lowercase = true, $anal = false) {
$strip = array("~", "`", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
"}", "\\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
"—", "–", ",", "<", ".", ">", "/", "?");
$clean = trim(str_replace($strip, "", strip_tags($string)));
$clean = preg_replace('/\s+/', "-", $clean);
$clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
return ($force_lowercase) ?
(function_exists('mb_strtolower')) ?
mb_strtolower($clean, 'UTF-8') :
strtolower($clean) :
$clean;
}
and this one in the wordpress code
/**
* Sanitizes a filename replacing whitespace with dashes
*
* Removes special characters that are illegal in filenames on certain
* operating systems and special characters requiring special escaping
* to manipulate at the command line. Replaces spaces and consecutive
* dashes with a single dash. Trim period, dash and underscore from beginning
* and end of filename.
*
* @since 2.1.0
*
* @param string $filename The filename to be sanitized
* @return string The sanitized filename
*/
function sanitize_file_name( $filename ) {
$filename_raw = $filename;
$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}");
$special_chars = apply_filters('sanitize_file_name_chars', $special_chars, $filename_raw);
$filename = str_replace($special_chars, '', $filename);
$filename = preg_replace('/[\s-]+/', '-', $filename);
$filename = trim($filename, '.-_');
return apply_filters('sanitize_file_name', $filename, $filename_raw);
}
Alix Axel has done some incredible work in this area. His phunction framework includes several great text filters and transformations.
Some observations on your solution:
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.
So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug
OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.
The Encoder interface provides:
canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)
https://github.com/OWASP/PHP-ESAPI https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API
This should make your filenames safe...
$string = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $string);
and a deeper solution to this is:
// Remove special accented characters - ie. sí.
$clean_name = strtr($string, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E','É' => 'E','Ê' => 'E','Ë' => 'E','Ì' => 'I','Í' => 'I','Î' => 'I','Ï' => 'I','Ñ' => 'N','Ò' => 'O','Ó' => 'O','Ô' => 'O','Õ' => 'O','Ö' => 'O','Ø' => 'O','Ù' => 'U','Ú' => 'U','Û' => 'U','Ü' => 'U','Ý' => 'Y','à' => 'a','á' => 'a','â' => 'a','ã' => 'a','ä' => 'a','å' => 'a','ç' => 'c','è' => 'e','é' => 'e','ê' => 'e','ë' => 'e','ì' => 'i','í' => 'i','î' => 'i','ï' => 'i','ñ' => 'n','ò' => 'o','ó' => 'o','ô' => 'o','õ' => 'o','ö' => 'o','ø' => 'o','ù' => 'u','ú' => 'u','û' => 'u','ü' => 'u','ý' => 'y','ÿ' => 'y'));
$clean_name = strtr($clean_name, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));
$clean_name = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $clean_name);
This assumes that you want a dot in the filename. if you want it transferred to lowercase, just use
$clean_name = strtolower($clean_name);
for the last line.
Try this:
function normal_chars($string)
{
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', $string);
$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
$string = preg_replace(array('~[^0-9a-z]~i', '~[ -]+~'), ' ', $string);
return trim($string, ' -');
}
Examples:
echo normal_chars('Álix----_Ãxel!?!?'); // Alix Axel
echo normal_chars('áéíóúÁÉÍÓÚ'); // aeiouAEIOU
echo normal_chars('üÿÄËÏÖÜŸåÅ'); // uyAEIOUYaA
Based on the selected answer in this thread: URL Friendly Username in PHP?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With