Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: regexpr() how to use a vector in pattern parameter

I would like to learn the positions of terms from a dictionary found in a set of short texts. The problem is in the last lines of the following code roughly based on From of list of strings, identify which are human names and which are not

library(tm)

pkd.names.quotes <- c(
  "Mr. Rick Deckard",
  "Do Androids Dream of Electric Sheep",
  "Roy Batty",
  "How much is an electric ostrich?",
  "My schedule for today lists a six-hour self-accusatory depression.",
  "Upon him the contempt of three planets descended.",
  "J.F. Sebastian",
  "Harry Bryant",
  "goat class",
  "Holden, Dave",
  "Leon Kowalski",
  "Dr. Eldon Tyrell"
) 


firstnames <- c("Sebastian", "Dave", "Roy",
                "Harry", "Dave", "Leon",
                "Tyrell")

dict  <- sort(unique(tolower(firstnames)))

corp <- VCorpus(VectorSource(pkd.names.quotes))
#strange but Corpus() gives wrong segment numbers for the matches.

tdm  <-
  TermDocumentMatrix(corp, control = list(tolower = TRUE, dictionary = dict))

inspect(corp)
inspect(tdm)

View(as.matrix(tdm))

data.frame(
  Name      = rownames(tdm)[tdm$i],
  Segment = colnames(tdm)[tdm$j],
  Content = pkd.names.quotes[tdm$j],
  Postion = regexpr(
    pattern = rownames(tdm)[tdm$i],
    text = tolower(pkd.names.quotes[tdm$j])
  )
)

The output is with a warning and only the first line correct.

       Name Segment          Content Postion
1       roy       3        Roy Batty       1
2 sebastian       7   J.F. Sebastian      -1
3     harry       8     Harry Bryant      -1
4      dave      10     Holden, Dave      -1
5      leon      11    Leon Kowalski      -1
6    tyrell      12 Dr. Eldon Tyrell      -1

Warning message:
In regexpr(pattern = rownames(tdm)[tdm$i], text = tolower(pkd.names.quotes[tdm$j])) :
  argument 'pattern' has length > 1 and only the first element will be used

I know the solution with pattern=paste(vector,collapse="|") but my vector can be very long (all popular names).

Can there be an easy vectorized version of this command or a solution that for each row accepts a new pattern parameter?

like image 698
Jacek Kotowski Avatar asked Oct 29 '22 05:10

Jacek Kotowski


1 Answers

You may vectorize regexpr using mapply:

mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on.

Use

data.frame(
  Name      = rownames(tdm)[tdm$i],
  Segment = colnames(tdm)[tdm$j],
  Content = pkd.names.quotes[tdm$j],
  Postion = mapply(regexpr, rownames(tdm)[tdm$i], tolower(pkd.names.quotes[tdm$j]), fixed=TRUE)
)

Result:

               Name Segment          Content Postion
roy             roy       3        Roy Batty       1
sebastian sebastian       7   J.F. Sebastian       6
harry         harry       8     Harry Bryant       1
dave           dave      10     Holden, Dave       9
leon           leon      11    Leon Kowalski       1
tyrell       tyrell      12 Dr. Eldon Tyrell      11

Alternatively, use stringr str_locate:

Vectorised over string and pattern

It returns:

For str_locate, an integer matrix. First column gives start postion of match, and second column gives end position.

Use

str_locate(tolower(pkd.names.quotes[tdm$j]), fixed(rownames(tdm)[tdm$i]))[,1]

Note that fixed() is used if you need to match the strings with fixed (i.e. non-regex patterns). Else, remove fixed() and fixed=TRUE.

like image 128
Wiktor Stribiżew Avatar answered Nov 15 '22 06:11

Wiktor Stribiżew