Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for UTF-8 valid filenames

I am trying to process the names of the files my users upload. I want to support all valid UTF-8 characters except those that might pose a problem for display on an HTML webpage, access over a CLI interface, or storage and retrieval on a filesystem.

Anyway, I came up with the following lenient function and I'm wondering if it's safe enough to be used. I use prepared statements for all database queries and I always html encode my output, but I still like to know that this is also a well thought through approach.

// $filename = $_FILES['file']['name'];

$filename = 'Filename 123;".\'"."la\l[a]*(/.jpg
∮ E⋅da = Q,  n → ∞, ∑ f(i) = ∏ g(i), ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β),
  ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ, ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (A ⇔ B),
  2H₂ + O₂ ⇌ 2H₂O, R = 4.7 kΩ, ⌀ 200 mm
sfajs,-=[];\',./09μετράει
าวนั้นเป็นชน
Καλημέρα κόσμε, コンニチハ
()_+{}|":?><';


// Replace symbols, punctuation, and ASCII control characters like \n or [BEL]
$filename = preg_replace('~[\p{S}\p{P}\p{C}]+~u', ' ', $filename);

Is this approach safe for me, and suitable for my users?

Update

To clarify, I do not use the filename for the name of the file on the filesystem. I generate a unique hash and use that - I just need to save the original name for the users befit since that is how they recognize their files. A SHA1 hash or UUID doesn't mean a thing to them.

like image 440
Xeoncross Avatar asked Aug 14 '12 18:08

Xeoncross


People also ask

What are valid UTF-8 characters?

UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.

Is valid UTF-8?

Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.

What is encoding='UTF-8?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

Is UTF-8 Unicode?

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.


1 Answers

The very first thing you need to do is to check your input is UTF-8.

mb_internal_encoding and mb_check_encoding are your friends.

You are using a blacklist, when it's good security practice to use a whitelist of allowed input.

Edit after the clarification:

You should be safe. Remember to filter Lm and No as well if you don't want to summon Zalgo.

like image 54
InternetSeriousBusiness Avatar answered Sep 22 '22 15:09

InternetSeriousBusiness