Here is the R Code:
library(NLP)
library(openNLP)
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)}
str <- "this is a the first sentence."
tagged_str <- tagPOS(str)
Output is :
tagged_str $POStagged [1]"this/DT is/VBZ a/DT the/DT first/JJ sentence/NN ./."
Now I want to extract only NN word i.e sentence from the above sentence and want to store it into a variable .Can anyone help me out with this .
To pos-tag a text, we start by loading an example text into R. Now that the text data has been read into R, we can proceed with the part-of-speech tagging. To perform the pos-tagging, we load the function for pos-tagging, load the NLP and openNLP packages.
In simple words, we can say that POS tagging is a task of labelling each word in a sentence with its appropriate part of speech. We already know that parts of speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories.
A POS tag (or part-of-speech tag) is a special label assigned to each token (word) in a text corpus to indicate the part of speech and often also other grammatical categories such as tense, number (plural/singular), case etc. POS tags are used in corpus searches and in text analysis tools and algorithms.
Stochastic Part-of-Speech Tagging The simplest stochastic taggers disambiguate words based solely on the probability that a word occurs with a particular tag. In other words, the tag encountered most frequently in the training set with the word is the one assigned to an ambiguous instance of that word.
Here is a more general solution, where you can describe the Treebank tag you desire to extract using a regular expression. In your case for instance, "NN" returns all noun types (e.g. NN, NNS, NNP, NNPS) while "NN$" returns just NN.
It operates on a character type, so if you have your texts as a list, you will need to lapply()
it as in the examples below.
txt <- c("This is a short tagging example, by John Doe.",
"Too bad OpenNLP is so slow on large texts.")
extractPOS <- function(x, thisPOSregex) {
x <- as.String(x)
wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
POSwords <- subset(POSAnnotation, type == "word")
tags <- sapply(POSwords$features, '[[', "POS")
thisPOSindex <- grep(thisPOSregex, tags)
tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex], tags[thisPOSindex])
untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
untokenizedAndTagged
}
lapply(txt, extractPOS, "NN")
## [[1]]
## [1] "tagging/NN example/NN John/NNP Doe/NNP"
##
## [[2]]
## [1] "OpenNLP/NNP texts/NNS"
lapply(txt, extractPOS, "NN$")
## [[1]]
## [1] "tagging/NN example/NN"
##
## [[2]]
## [1] ""
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With