Profanity Filter using a Regular Expression (list of 100 words)

Tags:

What is the correct way to strip profane words from a string given:
1) I have a list of 100 words to look for in an array of strings. 2) What is the correct way to handle partial words? How do most people handle this? For example the word mass. Then sometimes a partial word is also bad - assume foobar is an extremely profane word I may want to disallow foobar and foobar* and *foobar.

So do you put all the words into a single expression or loop through the list?

What's the right way to tackle it? I'm using Groovy/Grails but any modern languages examples welcome.

370

asked Nov 29 '11 23:11

BuddyJoe

1 Answers

This is quite a difficult problem to solve and you need determine if regular expressions will work for you and how you handle embedding (when you add a dictionary word to profanity like frackface except with the real F-word).

Regular expressions generally have a limit to how long they can be and this usually prevents you from using a single regex for all your words. Executing multiple regular expressions against a string is really slow, depending on what performance you need and how big your blacklist gets. We initially implement CleanSpeak as a regular expression system, but it didn't scale and we rewrote it using a different mechanism.

You also need to consider phrases, punctuation, spaces, leet-speak and other languages. All of these make regular expressions less appealing as a solution. Here are some examples using the word hello (assume it is profanity for this exercise):

List item
h e l l o
h.e.l.l.o
h_e_l_l_o
|-|ello
h3llo
"hello there" (this phrase might not contain any profane words but combined they are profane)

You also need to handle edge cases where two or more dictionary (whitelist) words contain a profanity when next to each other. Some examples that contain the s-word:

bash it
ssh it's quiet time

These are obviously not profanity, but most homegrown and many commercial solutions have problems with these cases.

We have spent the last 3 years perfecting the filter used by CleanSpeak to ensure it handles all of these cases and we continue to tweak it and make it better. We also spent 8 months perfecting our system for performance and it can handle about 5,000 messages per second. Not to say you can't build something usable, but be prepared to handle a lot of issues that might come up and also to create a system that doesn't use regular expressions.

195

answered Sep 19 '22 16:09

voidmain

Related questions
                            
                                PHP: split a string of alternating groups of characters into an array
                            
                                Find the first occurrence with Regex and Java
                            
                                How to write a regex to match title case sentence (Ex: I Love To Work)
                            
                                Remove spaces at the start of each line in a multiline string variable
                            
                                scikit-learn: don't separate hyphenated words while tokenization
                            
                                Perl: how to use string variables as search pattern and replacement in regex
                            
                                Interpolate a variable into a regular expression
                            
                                REGEXEXTRACT with capturing group
                            
                                How to split string but keep delimiters in java? [duplicate]
                            
                                How do I make and access regex capture groups in Django without RawSQL?
                            
                                Raku Is there a fast method to find and remove/replace non-ASCII or malformed utf8 characters?
                            
                                Regex pattern for Pilcrow (¶) or Partial Differential (∂) character
                            
                                Regular Expression Named Groups: Good or Bad?
                            
                                Groovy: Escaping an arbitrary (unknown) regular expression
                            
                                Regular Expression Longest Possible Matching
                            
                                How to use C++ Boost's regex_iterator()
                            
                                Replace comma in parentheses using regex in java
                            
                                Is there an Objective-c regex replace with callback/C# MatchEvaluator equivalent?
                            
                                How to select the second word with jQuery and Regex..?
                            
                                Perl regex replace in same case

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Profanity Filter using a Regular Expression (list of 100 words)

Tags:

language-agnostic

regex

profanity

BuddyJoe

People also ask

1 Answers

voidmain

Recent Activity

Donate For Us