Tokenizing sentences with unnest_tokens(), ignoring abbreviations

Question

I'm using the excellent tidytext package to tokenize sentences in several paragraphs. For instance, I want to take the following paragraph:

"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."

and tokenize it into the two sentences

"I am perfectly convinced by it that Mr. Darcy has no defect."
"He owns it himself without disguise."

However, when I use the default sentence tokenizer of tidytext I get three sentences.

Code

df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))


unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")

Result

# A tibble: 3 x 1
                              Sentence
                                <chr>
1 i am perfectly convinced by it that mr.
2                    darcy has no defect.
3    he owns it himself without disguise.

What is a simple way to use tidytext to tokenize sentences but without running into issues with common abbreviations such as "Mr." or "Dr." being interpreted as sentence endings?

acylam · Accepted Answer

You can use a regex as splitting condition, but there is no guarantee that this would include all common hororifics:

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = "(?<!\b\p{L}r)\.")

Result:

# A tibble: 2 x 1
                                                     Sentence
                                                        <chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2                         he owns it himself without disguise

You can of course always create your own list of common titles, and create a regex based on that list:

titles =  c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\b(", paste(titles, collapse = "|"), "))\.")
# > regex
# [1] "(?<!\b(Mr|Dr|Mrs|Ms|Sr|Jr))\."

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = regex)

Patrick Perry · Answer

Both corpus and quanteda have special handling for abbreviations when determining sentence boundaries. Here's how to split sentences with corpus:

library(dplyr)
library(corpus)
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))

text_split(df$Example_Text, "sentences")
##   parent index text                                                         
## 1 1          1 I am perfectly convinced by it that Mr. Darcy has no defect. 
## 2 1          2 He owns it himself without disguise.

If you want to stick with unnest_tokens, but want a more exhaustive list of English abbreviations, you can follow @useR's advice but use the corpus abbreviation list (most of which were taken from the Common Locale Data Repository):

abbrevations_en
##  [1] "A."       "A.D."     "a.m."     "A.M."     "A.S."     "AA."       
##  [7] "AB."      "Abs."     "AD."      "Adj."     "Adv."     "Alt."    
## [13] "Approx."  "Apr."     "Aug."     "B."       "B.V."     "C."      
## [19] "C.F."     "C.O.D."   "Capt."    "Card."    "cf."      "Col."    
## [25] "Comm."    "Conn."    "Cont."    "D."       "D.A."     "D.C."    
## (etc., 155 total)

Tokenizing sentences with unnest_tokens(), ignoring abbreviations

Tags:

text

r

tidytext

bschneidr

2 Answers

acylam

Patrick Perry

Recent Activity

Donate For Us

Tokenizing sentences with unnest_tokens(), ignoring abbreviations

Tags:

text

r

tidytext

bschneidr

2 Answers

acylam

Patrick Perry

Related questions

Recent Activity

Donate For Us