I would like to learn the positions of terms from a dictionary found in a set of short texts. The problem is in the last lines of the following code roughly based on From of list of strings, identify which are human names and which are not
library(tm)
pkd.names.quotes <- c(
"Mr. Rick Deckard",
"Do Androids Dream of Electric Sheep",
"Roy Batty",
"How much is an electric ostrich?",
"My schedule for today lists a six-hour self-accusatory depression.",
"Upon him the contempt of three planets descended.",
"J.F. Sebastian",
"Harry Bryant",
"goat class",
"Holden, Dave",
"Leon Kowalski",
"Dr. Eldon Tyrell"
)
firstnames <- c("Sebastian", "Dave", "Roy",
"Harry", "Dave", "Leon",
"Tyrell")
dict <- sort(unique(tolower(firstnames)))
corp <- VCorpus(VectorSource(pkd.names.quotes))
#strange but Corpus() gives wrong segment numbers for the matches.
tdm <-
TermDocumentMatrix(corp, control = list(tolower = TRUE, dictionary = dict))
inspect(corp)
inspect(tdm)
View(as.matrix(tdm))
data.frame(
Name = rownames(tdm)[tdm$i],
Segment = colnames(tdm)[tdm$j],
Content = pkd.names.quotes[tdm$j],
Postion = regexpr(
pattern = rownames(tdm)[tdm$i],
text = tolower(pkd.names.quotes[tdm$j])
)
)
The output is with a warning and only the first line correct.
Name Segment Content Postion
1 roy 3 Roy Batty 1
2 sebastian 7 J.F. Sebastian -1
3 harry 8 Harry Bryant -1
4 dave 10 Holden, Dave -1
5 leon 11 Leon Kowalski -1
6 tyrell 12 Dr. Eldon Tyrell -1
Warning message:
In regexpr(pattern = rownames(tdm)[tdm$i], text = tolower(pkd.names.quotes[tdm$j])) :
argument 'pattern' has length > 1 and only the first element will be used
I know the solution with pattern=paste(vector,collapse="|") but my vector can be very long (all popular names).
Can there be an easy vectorized version of this command or a solution that for each row accepts a new pattern parameter?
You may vectorize regexpr
using mapply
:
mapply
is a multivariate version ofsapply
.mapply
applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on.
Use
data.frame(
Name = rownames(tdm)[tdm$i],
Segment = colnames(tdm)[tdm$j],
Content = pkd.names.quotes[tdm$j],
Postion = mapply(regexpr, rownames(tdm)[tdm$i], tolower(pkd.names.quotes[tdm$j]), fixed=TRUE)
)
Result:
Name Segment Content Postion
roy roy 3 Roy Batty 1
sebastian sebastian 7 J.F. Sebastian 6
harry harry 8 Harry Bryant 1
dave dave 10 Holden, Dave 9
leon leon 11 Leon Kowalski 1
tyrell tyrell 12 Dr. Eldon Tyrell 11
Alternatively, use stringr str_locate
:
Vectorised over string and pattern
It returns:
For
str_locate
, an integer matrix. First column gives start postion of match, and second column gives end position.
Use
str_locate(tolower(pkd.names.quotes[tdm$j]), fixed(rownames(tdm)[tdm$i]))[,1]
Note that fixed()
is used if you need to match the strings with fixed (i.e. non-regex patterns). Else, remove fixed()
and fixed=TRUE
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With