Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove offensive words from post by php?

Tags:

html

php

Assume "xyza" is a bad word. I'm using following method to replace offensive words-

$text = str_replace("x***","(Offensive words detected & removed!)",$text);

This code will replace xyza into "(Offensive words detected & removed!)".

But problem is "Case" if someone type XYZA my code can't detect it. How to solve it?

like image 284
netmaster Avatar asked Aug 21 '13 07:08

netmaster


3 Answers

No matter what you do, users will find ways to get around your filters. They will use unicode characters (аss, for example, uses a Cyrillic а and will not get captured by any of the regex solutions). They will use spaces, dollar signs, asterisks, whatever you haven't managed to catch yet.

If family-friendliness is essential to your application, have a person review the content before it goes live. Otherwise, add a flag feature so other people can flag offensive content. Better yet, use some sort of machine learning or Bayesian filter to automatically flag potentially offensive posts and have humans check them out manually. People read human languages better than computers.

like image 87
Blender Avatar answered Oct 30 '22 14:10

Blender


The problem with whitelists/blacklists is—as other users have pointed out—your users will make it their priority to find ways around your filter for satisfaction rather than using your website for what it was intended for, whatever that may be.

One approach would be to use Google’s undocumented profanity API it created for its “What Do You Love?” website. If you get a response of true then just give the user a message saying their post couldn’t be submitted due to detected profanity.

You could approach this as follows:

<?php

if (isset($_POST['submit'])) {
    $result = json_decode(file_get_contents(sprintf('http://www.wdyl.com/profanity?q=%s', urlencode($_POST['comments']))));
    if ($result->response == true) {
        // profanity detected
    }
    else {
        // save comments to database as normal
    }
}
like image 41
Martin Bean Avatar answered Oct 30 '22 14:10

Martin Bean


Other answers and comments say that programming is not the best solution to this problem. I agree with them. Those answers should be moved to Moderators - Stack Exchange or Webmasters - Stack Exchange.

Since this is stackoverflow, my answer is going to be based on computer programming.

If you want to use str_replace, do something like this. For the sake of this post, since some people are offended by actual cusswords, let's pretend that these are bad words: 'fug', 'schnitt', 'dam'.

$text = str_ireplace(" fug ","(Offensive words detected & removed!)",$text);

Notice, it's str_ireplace not str_replace. The i is for "case insensitive". But that will erroneously match "fuggedaboudit," for example.

If you want to do a more reliable job, you need to use regex.

$bad_text = "Fug dis schnitt, because a schnitter never dam wins a fuggin schnitting darn";
$hit_words = array("fug","schnitt","dam"); // these words are 'hits' that we need to replace. hit words...
array_walk($hit_words, function(&$value, $key) { // this prepares the regex, requires PHP 5.3+ I think.
  $value = '~\b' . preg_quote( $value ,'~') . '\b~i'; // \b means word boundary, like space, line-break, period, dash, and many others. Prevends "refudgee" from being matched when searching for "fudge"
});

/*print_r($bad_words);*/
$good_words = array("fudge","shoot","dang");

$good_text = preg_replace($hit_words,$good_words,$bad_text); // does all search/replace actions at once

echo '<br />' . $good_text . '<br />';

That will do all your search/replacements at once. The two arrays should contain the same number of elements, matching up searches and replace terms. It will not match parts of words, only whole words. And of course, determined cussers will find ways of getting their swearing onto your website. But it will stop lazy cussers.

I've decided to add some links to sites that obviously use programming to do a first run through removing profanity. I'll add more as I come across them. Other than yahoo:

1.) Dell.com - replace matching words with <profanity deleted>. http://en.community.dell.com/support-forums/peripherals/f/3529/t/19502072.aspx

2.) Watson, the supercomputer, apparently developed a cursing problem. How do you tell the difference between cursing and slang? Apparently, it's so hard that the researchers just decided to purge it all. But they could have just used a list of curse words ( exact matching is a subset of regex, I would say) and forbidden their use. That's kind of how it works in real life, anyway. Watson develops a profanity problem

3.) Content Compliance section of Gmail custom settings in Apps for Business:

  1. Add expressions that describe the content you want to search for in each message

The "Expresssions" used can be of several types, including "Advanced content match", which, among other things, allows you to choose "Match type" options very similar to what you'd have in an excel filter: Starts with, Ends with, Contains, Not contains, Equals, Is Empty, all of which presumably use Regex. But wait, there's more: Matches regex, Not matches regex, Matches any word, Matches all words. So, the mighty Google implements regex filtering options for its business users. Why would it do that, when regex is supposedly so ineffective? Because it actually is effective enough. It is a simple, fast, programming solution that will only fail when people are hell-bent on circumventing it.

Besides that list, I wonder if anyone else has noticed the similarity between weeding out profanity and filtering out spam. Clearly, regex has uses in both arenas but nitpickers who learned by rote that "all regex is bad" will always downvote any answer to any question if regex is even mentioned. Try googling "how spam filters work". You'll get results like this one that covers spam assassin: http://www.seas.upenn.edu/cets/answers/spamblock-filter.html

Another example where I'm sure regex is used is when communicating via Amazon.com's Amazon Marketplace. You receive emails at your usual email address. So, naturally, when responding to a seller, your email program will include all kinds of sender information, like your email address, cc email addresses, and any you enter into the body. But Amazon.com strips these out "for your protection." Can I find a way around this regex? Probably, but it would take more trouble than it's worth and is therefore effective to a degree. They also keep the emails for 2 years, presumably so that a human can go over them in case of any fraud claims.

SpamAssassin also looks at the subject and body of the message for the same sort of things that a person notices when a message "looks like spam". It searches for strings like "viagra", "buy now", "lowest prices", "click here", etc. It also looks for flashy HTML such as large fonts, blinking text, bright colors, etc.

Regex is not mentioned, but I'm sure it's in use.

like image 2
Buttle Butkus Avatar answered Oct 30 '22 13:10

Buttle Butkus