R break corpus into sentences

1 Answers

I don't know how to reshape a corpus but that would be a fantastic functionality to have.

I guess my approach would be something like this:

Using these packages

# Load Packages
require(tm)
require(NLP)
require(openNLP)

I would set up my text to sentences function as follows:

convert_text_to_sentences <- function(text, lang = "en") {
  # Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'. 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)

  # Convert text to class String from package NLP
  text <- as.String(text)

  # Sentence boundaries in text
  sentence.boundaries <- annotate(text, sentence_token_annotator)

  # Extract sentences
  sentences <- text[sentence.boundaries]

  # return sentences
  return(sentences)
}

And my hack of a reshape corpus function (NB: you will lose the meta attributes here unless you modify this function somehow and copy them over appropriately)

reshape_corpus <- function(current.corpus, FUN, ...) {
  # Extract the text from each document in the corpus and put into a list
  text <- lapply(current.corpus, Content)

  # Basically convert the text
  docs <- lapply(text, FUN, ...)
  docs <- as.vector(unlist(docs))

  # Create a new corpus structure and return it
  new.corpus <- Corpus(VectorSource(docs))
  return(new.corpus)
}

Which works as follows:

## create a corpus
dat <- data.frame(doc1 = "Doctor Who is a British science fiction television programme produced by the BBC. The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor. He explores the universe in his TARDIS (acronym: Time and Relative Dimension in Space), a sentient time-travelling space ship. Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired. Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.",
                  doc2 = "The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive (2005–10) awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.[3][4] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor. In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody \"for evolving with technology and the times like nothing else in the known television universe.\"[5]",
                  doc3 = "The programme is listed in Guinness World Records as the longest-running science fiction television show in the world[6] and as the \"most successful\" science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.[7] During its original run, it was recognised for its imaginative stories, creative low-budget special effects, and pioneering use of electronic music (originally produced by the BBC Radiophonic Workshop).",
                  stringsAsFactors = FALSE)

current.corpus <- Corpus(VectorSource(dat))
# A corpus with 3 text documents

## reshape the corpus into sentences (modify this function if you want to keep meta data)
reshape_corpus(current.corpus, convert_text_to_sentences)
# A corpus with 10 text documents

My sessionInfo output

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
  [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] NLP_0.1-0     openNLP_0.2-1 tm_0.5-9.1   

loaded via a namespace (and not attached):
  [1] openNLPdata_1.5.3-1 parallel_3.0.1      rJava_0.9-4         slam_0.1-29         tools_3.0.1

156

answered Oct 11 '22 06:10

Tony Breyal

Related questions
                            
                                Split character vector at math comparisons signs in R
                            
                                Could not find function 'fread' in R 3.4 while reading a big dataset
                            
                                Convert scientific notation to numeric, preserving decimals
                            
                                How to fix "failed to load cairo DLL" in R?
                            
                                What's the difference between ggplot and basic plot in R? [closed]
                            
                                Warning: “Variables with usage in documentation object ‘FANG’ but not in code:”
                            
                                Making R package work in both Windows and Linux
                            
                                How can I collapse a dataframe by some variables, taking mean across others
                            
                                class "By" into dataframe in R
                            
                                R interactive plot?
                            
                                Interpolate missing values in a time series with a seasonal cycle
                            
                                Converting a Document Term Matrix into a Matrix with lots of data causes overflow
                            
                                Importing an array from matlab into R
                            
                                Pairwise Correlation Table
                            
                                R Programming - Sum Elements of Rows with Common Values
                            
                                Merge dataframes on matching A, B and *closest* C?
                            
                                Plot data over background image with ggplot
                            
                                Convert named vector to dataframe
                            
                                Behavior of summing !is.na() results
                            
                                sum multiple columns by group with tapply

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R break corpus into sentences

Tags:

split

r

tm

sentence

qdap

Henk

People also ask

1 Answers

Tony Breyal

Recent Activity

Donate For Us