Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to parse a string for "bad" words in C#?

I'm thinking of something like:

foreach (var word in paragraph.split(' ')) {
  if (badWordArray.Contains(word) {
    // do something about it
  }
}

but I'm sure there's a better way.

Thanks in advance!

UPDATE I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used. Then I'll review it myself to make sure it's legit. An auto flagging system of sorts.

like image 804
Chaddeus Avatar asked Jul 09 '10 03:07

Chaddeus


3 Answers

While your way works, it may be a bit time consuming. There is a wonderful response here for a previous SO question. Though the question talks about PHP instead of C#, I think it can be easily ported.

Edit to add sample code:

public string FilterWords(string inputWords) {
    Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
    return wordFilter.Replace(inputWords, "<3");
}

That should work for you, more or less.

Edit to answer OP clarification:

I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used.

Much as the replacement portion above, you can see if something matches like so:

public bool HasBadWords(string inputWords) {
    Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
    return wordFilter.IsMatch(inputWords);
}

It will return true if the string you passed to it contains any words in the list.

like image 56
rakuo15 Avatar answered Sep 21 '22 13:09

rakuo15


At my job we put some automatic bad word filtering into our software (it's kind of shocking to be browsing the source and suddenly run across the array containing several pages of obscenity).

One tip is to pre-process the user input before testing against your list, in that case that someone is trying to sneak something by you. So by way of preprocessing, we

  • uppercase everything in the input
  • remove most non-alphanumerics (that is, just splice out any spaces, or punctuation, etc.)
  • and then assuming someone is trying to pass off digits for letters, do the something like this: replace zero with O, 9 with G, 5 with S, etc. (get creative)

And then get some friends to try to break it. It's fun.

like image 24
Detmar Avatar answered Sep 21 '22 13:09

Detmar


You could consider using the HashKey objects or Dictionary<T1, T2> instead of the array as using a Dictionary for example can make code more efficient, because the .Contains() method becomes .Keys.Contains() which is way more efficient. This is especially true if you have a large list of profanities (not sure how many there are! :)

like image 39
Alex Avatar answered Sep 17 '22 13:09

Alex