Extracting noun+noun or (adj|noun)+noun from Text

Tags:

I would like to query if it is possible to extract noun+noun or (adj|noun)+noun in R package openNLP?That is, I would like to use linguistic filtering to extract candidate noun phrases. Could you direct me how to do? Many thanks.

Thanks for the responses. here is the code:

library("openNLP")

acq <- "Gulf Applied Technologies Inc said it sold its subsidiaries engaged in
        pipeline and terminal operations for 12.2 mln dlrs. The company said 
        the sale is subject to certain post closing adjustments, 
        which it did not explain. Reuter." 

acqTag <- tagPOS(acq)    
acqTagSplit = strsplit(acqTag," ")
acqTagSplit

qq = 0
tag = 0

for (i in 1:length(acqTagSplit[[1]])){
    qq[i] <-strsplit(acqTagSplit[[1]][i],'/')
    tag[i] = qq[i][[1]][2]
}

index = 0

k = 0

for (i in 1:(length(acqTagSplit[[1]])-1)) {

    if ((tag[i] == "NN" && tag[i+1] == "NN") | 
        (tag[i] == "NNS" && tag[i+1] == "NNS") | 
        (tag[i] == "NNS" && tag[i+1] == "NN") | 
        (tag[i] == "NN" && tag[i+1] == "NNS") | 
        (tag[i] == "JJ" && tag[i+1] == "NN") | 
        (tag[i] == "JJ" && tag[i+1] == "NNS"))
    {      
            k = k +1
            index[k] = i
    }

}

index

Reader can refer index on acqTagSplit to do noun+noun or (adj|noun)+noun extractation. (The code is not optimum but work. If you have any idea, please let me know.)

Furthermore, I still have a problem.

Justeson and Katz (1995) proposed another linguistic filtering to extract candidate noun phrases:

((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun

I cannot well understand its meaning. Could you do me a favor to explain it or transform such representation into R language. Many thanks.

817

asked Jan 05 '11 03:01

ssuhan

1 Answers

I don't have an open console on which to test this, but have your tried to tokenize with tagPOS and then grep for "noun", "noun" or perhaps paste(tagPOS(acq), collapse=".") and search for "noun.noun". Then gregexpr could be used to extract positions.

EDIT: The format of the tagged output was a bit different than I remembered. I think this method of read.table()-ing after substituting "\n"s for spaces is much more efficient than what I see above:

 acqdf <- read.table(textConnection(gsub(" ", "\n", acqTag)), sep="/", stringsAsFactors=FALSE)
 acqdf$nnadj <- grepl("NN|JJ", acqdf$V2)
 acqdf$nnadj 
# [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
#[16] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
#[31]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 acqdf$nnadj[1:(nrow(acqdf)-1)] & acqdf$nnadj[2:nrow(acqdf)]
# [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#[16] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
#[31] FALSE FALSE FALSE FALSE FALSE FALSE
 acqdf$pair <- c(NA, acqdf$nnadj[1:(nrow(acqdf)-1)] & acqdf$nnadj[2:nrow(acqdf)])
 acqdf[1:7, ]

            V1  V2 nnadj  pair
1         Gulf NNP  TRUE    NA
2      Applied NNP  TRUE  TRUE
3 Technologies NNP  TRUE  TRUE
4          Inc NNP  TRUE  TRUE
5         said VBD FALSE FALSE
6           it PRP FALSE FALSE
7         sold VBD FALSE FALSE

149

answered Sep 27 '22 17:09

IRTFM

Related questions
                            
                                Key-value mapping of axis/variable labels in ggplot
                            
                                Automatically - "Convert numbers stored as text to numbers"
                            
                                Columns not available for when training lasso model using caret
                            
                                DT Editing in Shiny application with client-side processing (server = F) throws JSON Error
                            
                                Pass a named list of models to anova.merMod
                            
                                How to check whether a vector is LIFO/FIFO decreasing
                            
                                Error in gam function in names(x) <- value: 'names' attribute must be the same length as the vector
                            
                                Reconnect to PostgreSQL database with R's pool package
                            
                                How can I pass individual `curvature` arguments in `ggplot2` `geom_curve` function?
                            
                                Is there a faster way than fread() to read big data?
                            
                                Conditionally modify ggplot theme based on presence of facets?
                            
                                How to operator join two matrix in raku-lang？
                            
                                How to write two vectors of different length into one data frame by writing same values into same row?
                            
                                Calling R script from Python does not save log file in version 4
                            
                                How to increase the width of underline drawed in legend labels in ggplot?
                            
                                Cannot fix the lack of memory problem in running "pvargmm"
                            
                                Calculating percent of row total with plyr
                            
                                R: serialize objects to text file and back again
                            
                                How to add a condition to the geom_point size?
                            
                                How do I rename R sessions in ESS?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting noun+noun or (adj|noun)+noun from Text

Tags:

r

nlp

pos-tagger

opennlp

ssuhan

People also ask

1 Answers

IRTFM

Recent Activity

Donate For Us