I'm trying to find a way to negate sentences based on POS-tagging. Please consider:
include_once 'class.postagger.php';
function negate($sentence) {
$tagger = new PosTagger('includes/lexicon.txt');
$tags = $tagger->tag($sentence);
foreach ($tags as $t) {
$input[] = trim($t['token']) . "/" . trim($t['tag']) . " ";
}
$sentence = implode(" ", $input);
$postagged = $sentence;
// Concatenate "not" to every JJ, RB or VB
// Todo: ignore negative words (not, never, neither)
$sentence = preg_replace("/(\w+)\/(JJ|MD|RB|VB|VBD|VBN)\b/", "not$1/$2", $sentence);
// Remove all POS tags
$sentence = preg_replace("/\/[A-Z$]+/", "", $sentence);
return "$postagged<br>$sentence";
}
BTW: In this example, I'm using the POS-tagging implementation and lexicon of Ian Barber. An example of this code running would be:
echo negate("I will never go to their place again");
I/NN will/MD never/RB go/VB to/TO their/PRP$ place/NN again/RB
I notwill notnever notgo to their place notagain
As you can see, (and this issue is also commented in the code), negating words themselves are being negated as wel: never
becomes notnever
, which obviously shouldn't happen. Since my regex skills aren't all that, is there a way to exclude these words from the regex used?
[edit] Also, I would very much welcome other comments / critiques you might have in this negating implementation, since I'm sure it's (still) quite flawed :-)
It is generally called POS tagging. In simple words, we can say that POS tagging is a task of labelling each word in a sentence with its appropriate part of speech. We already know that parts of speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories.
POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.
POS tagging is a fundamental problem in NLP. There are many NLP tasks based on POS tags. Most good POS taggers report accuracy numbers of 97% and above on a per word (aka token) basis. Some scholars, however, have argued that the per token accuracy is not the best way to estimate the accuracy of the POS engine.
Steps Involved in the POS tagging example: Abbreviation Meaning CC coordinating conjunction CD cardinal digit DT determiner EX existential there 30 more rows ...
Give this a try:
$sentence = preg_replace("/(\s)(?:(?!never|neither|not)(\w*))\/(JJ|MD|RB|VB|VBD|VBN)\b/", "$1not$2", $sentence);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With