Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Natural Language Processing: Find obscenities in English?

Tags:

java

nlp

Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check for the presence of anything in the list? Should I try to use a regex to capture a bunch of variations on a single root?

If it makes it easier, I don't want to filter out, just to get a count. So if there are some false positives, it's not the end of the world, as long as there's a more or less uniformly over exaggerated rate.

like image 414
Nick Heiner Avatar asked Dec 02 '09 20:12

Nick Heiner


2 Answers

A huge list and think of the target audience. Is there 3rd party service that you can use that specialises in this rather than rolling your own?

Some quick thoughts:

  • The Scunthorpe problem (and follow the links to "Swear filter" for more)
  • British or American English? fanny, fag etc
  • Political correctness: "black" or "Afro-American"?

Edit:

  • Be very careful and again here. Normal words can offend, whether by choice or ignorance
like image 84
gbn Avatar answered Sep 28 '22 05:09

gbn


Is the phrase I want to stick my long-necked Giraffe up your fluffy white bunny obscene?

like image 39
Pete Kirkham Avatar answered Sep 28 '22 05:09

Pete Kirkham