I am trying to process the names of the files my users upload. I want to support all valid UTF-8 characters except those that might pose a problem for display on an HTML webpage, access over a CLI interface, or storage and retrieval on a filesystem.
Anyway, I came up with the following lenient function and I'm wondering if it's safe enough to be used. I use prepared statements for all database queries and I always html encode my output, but I still like to know that this is also a well thought through approach.
// $filename = $_FILES['file']['name'];
$filename = 'Filename 123;".\'"."la\l[a]*(/.jpg
∮ E⋅da = Q, n → ∞, ∑ f(i) = ∏ g(i), ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β),
ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ, ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (A ⇔ B),
2H₂ + O₂ ⇌ 2H₂O, R = 4.7 kΩ, ⌀ 200 mm
sfajs,-=[];\',./09μετράει
าวนั้นเป็นชน
Καλημέρα κόσμε, コンニチハ
()_+{}|":?><';
// Replace symbols, punctuation, and ASCII control characters like \n or [BEL]
$filename = preg_replace('~[\p{S}\p{P}\p{C}]+~u', ' ', $filename);
Is this approach safe for me, and suitable for my users?
To clarify, I do not use the filename for the name of the file on the filesystem. I generate a unique hash and use that - I just need to save the original name for the users befit since that is how they recognize their files. A SHA1 hash or UUID doesn't mean a thing to them.
UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.
Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.
The very first thing you need to do is to check your input is UTF-8.
mb_internal_encoding and mb_check_encoding are your friends.
You are using a blacklist, when it's good security practice to use a whitelist of allowed input.
Edit after the clarification:
You should be safe. Remember to filter Lm
and No
as well if you don't want to summon Zalgo.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With