Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Profanity Filter using a Regular Expression (list of 100 words)

What is the correct way to strip profane words from a string given:
1) I have a list of 100 words to look for in an array of strings. 2) What is the correct way to handle partial words? How do most people handle this? For example the word mass. Then sometimes a partial word is also bad - assume foobar is an extremely profane word I may want to disallow foobar and foobar* and *foobar.

So do you put all the words into a single expression or loop through the list?

What's the right way to tackle it? I'm using Groovy/Grails but any modern languages examples welcome.

like image 370
BuddyJoe Avatar asked Nov 29 '11 23:11

BuddyJoe


People also ask

How do you filter profanity?

Some of the best tools for profanity filtering include: Advanced Profanity Filter – a free browser plugin. Netflix Profanity Filter – ideal for censoring profanity on Netflix. ClearPlay – a paid streaming moderation tool.

What is the best profanity filter?

The TVGuardian is good at what it does and is our choice for filtering profanity.

Is there a profanity filter?

A profanity filter is a type of software that scans user-generated content (UGC) to filter out profanity within online communities, social platforms, marketplaces, and more. Moderators decide on which words to censor, including swear words, words associated with hate speech, harassment, etc.


1 Answers

This is quite a difficult problem to solve and you need determine if regular expressions will work for you and how you handle embedding (when you add a dictionary word to profanity like frackface except with the real F-word).

Regular expressions generally have a limit to how long they can be and this usually prevents you from using a single regex for all your words. Executing multiple regular expressions against a string is really slow, depending on what performance you need and how big your blacklist gets. We initially implement CleanSpeak as a regular expression system, but it didn't scale and we rewrote it using a different mechanism.

You also need to consider phrases, punctuation, spaces, leet-speak and other languages. All of these make regular expressions less appealing as a solution. Here are some examples using the word hello (assume it is profanity for this exercise):

  • List item
  • h e l l o
  • h.e.l.l.o
  • h_e_l_l_o
  • |-|ello
  • h3llo
  • "hello there" (this phrase might not contain any profane words but combined they are profane)

You also need to handle edge cases where two or more dictionary (whitelist) words contain a profanity when next to each other. Some examples that contain the s-word:

  • bash it
  • ssh it's quiet time

These are obviously not profanity, but most homegrown and many commercial solutions have problems with these cases.

We have spent the last 3 years perfecting the filter used by CleanSpeak to ensure it handles all of these cases and we continue to tweak it and make it better. We also spent 8 months perfecting our system for performance and it can handle about 5,000 messages per second. Not to say you can't build something usable, but be prepared to handle a lot of issues that might come up and also to create a system that doesn't use regular expressions.

like image 195
voidmain Avatar answered Sep 19 '22 16:09

voidmain