I've run into a few problems using a C# regex to implement a whitelist of allowed characters on web inputs. I am trying to avoid SQL injection and XSS attacks. I've read that whitelists of the allowable characters are the way to go.
The inputs are people names and company names.
Some of the problems are:
Company names that have ampersands. Like "Jim & Sons". The ampersand is important, but it is risky.
Unicode characters in names (we have asian customers for example), that enter their names using their character sets. I need to whitelist all these.
I find myself wanting to allow almost every character after seeing all the data that is in the DB already (and being entered by new users).
Any suggestions for a good whitelist that will handle these (and other) issues?
NOTE: It's a legacy system, so I don't have control of all the code. I was hoping to reduce the number of attacks by preventing bad data from getting into the system in the first place.
This SO thread has a lot of good discussion on protecting yourself from injection attacks.
In short:
In your case, you can limit the name field to a small character set. The company field will be more difficult, and you need to consider and balance your users need for freedom of entry with your need for site security. As others have said, trying to write your own custom sanitation methods is tricky and risky. Keep it simple and protect yourself through your architecture - don't simply rely on strings being "safe", even after sanitization.
EDIT:
To clarify - if you're trying to develop a whitelist, it's not something that the community can hand out, since it's entirely dependent on the data you want. But let's look at a example of a regex whitelist, perhaps for names. Say I've whitelisted A-Z and a-z and space.
Regex reWhiteList = new Regex("^[A-Za-z ]+$")
That checks to see if the entire string is composed of those characters. Note that a string with a number, a period, a quote, or anything else would NOT match this regex and thus would fail the whitelist.
if (reWhiteList.IsMatch(strInput))
// it's ok, proceed to step 2
else
// it's not ok, inform user they've entered invalid characters and try again
Hopefully this helps some more! With names and company names you'll have a tough-to-impossible time developing a rigorous pattern to check against, but you can do a simple allowable character list, as I showed here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With