Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenizing sentences with unnest_tokens(), ignoring abbreviations

Tags:

text

r

tidytext

I'm using the excellent tidytext package to tokenize sentences in several paragraphs. For instance, I want to take the following paragraph:

"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."

and tokenize it into the two sentences

  1. "I am perfectly convinced by it that Mr. Darcy has no defect."
  2. "He owns it himself without disguise."

However, when I use the default sentence tokenizer of tidytext I get three sentences.

Code

df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))


unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")

Result

# A tibble: 3 x 1
                              Sentence
                                <chr>
1 i am perfectly convinced by it that mr.
2                    darcy has no defect.
3    he owns it himself without disguise.

What is a simple way to use tidytext to tokenize sentences but without running into issues with common abbreviations such as "Mr." or "Dr." being interpreted as sentence endings?

like image 872
bschneidr Avatar asked Nov 06 '25 10:11

bschneidr


2 Answers

You can use a regex as splitting condition, but there is no guarantee that this would include all common hororifics:

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = "(?<!\\b\\p{L}r)\\.")

Result:

# A tibble: 2 x 1
                                                     Sentence
                                                        <chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2                         he owns it himself without disguise

You can of course always create your own list of common titles, and create a regex based on that list:

titles =  c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
# > regex
# [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = regex)
like image 171
acylam Avatar answered Nov 07 '25 23:11

acylam


Both corpus and quanteda have special handling for abbreviations when determining sentence boundaries. Here's how to split sentences with corpus:

library(dplyr)
library(corpus)
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))

text_split(df$Example_Text, "sentences")
##   parent index text                                                         
## 1 1          1 I am perfectly convinced by it that Mr. Darcy has no defect. 
## 2 1          2 He owns it himself without disguise.

If you want to stick with unnest_tokens, but want a more exhaustive list of English abbreviations, you can follow @useR's advice but use the corpus abbreviation list (most of which were taken from the Common Locale Data Repository):

abbrevations_en
##  [1] "A."       "A.D."     "a.m."     "A.M."     "A.S."     "AA."       
##  [7] "AB."      "Abs."     "AD."      "Adj."     "Adv."     "Alt."    
## [13] "Approx."  "Apr."     "Aug."     "B."       "B.V."     "C."      
## [19] "C.F."     "C.O.D."   "Capt."    "Card."    "cf."      "Col."    
## [25] "Comm."    "Conn."    "Cont."    "D."       "D.A."     "D.C."    
## (etc., 155 total)
like image 43
Patrick Perry Avatar answered Nov 08 '25 00:11

Patrick Perry



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!