This is my first question on SO so let me know if it can be improved. I am working on a natural language processing project in R and am trying to build a data.table that contains test cases. Here, I build a much simplified example:
texts.dt <- data.table(string = c("one",
"two words",
"three words here",
"four useless words here",
"five useless meaningless words here",
"six useless meaningless words here just",
"seven useless meaningless words here just to",
"eigth useless meaningless words here just to fill",
"nine useless meaningless words here just to fill up",
"ten useless meaningless words here just to fill up space"),
word.count = 1:10,
stop.at.word = c(0, 1, 2, 2, 4, 3, 3, 6, 7, 5))
This returns the data.table we will be working on:
string word.count stop.at.word
1: one 1 0
2: two words 2 1
3: three words here 3 2
4: four useless words here 4 2
5: five useless meaningless words here 5 4
6: six useless meaningless words here just 6 3
7: seven useless meaningless words here just to 7 3
8: eigth useless meaningless words here just to fill 8 6
9: nine useless meaningless words here just to fill up 9 7
10: ten useless meaningless words here just to fill up space 10 5
In the real application, values in the stop.at.word
column are determined at random (with an upper bound = word.count
- 1). Also, strings are not ordered by length but that should not make a difference.
The code should add two columns input
and output
, where input
contains the substring from position 1 up to stop.at.word
and output
contains the word that follows (single word), like so:
>desired_result
string word.count stop.at.word input
1: one 1 0
2: two words 2 1 two
3: three words here 3 2 three words
4: four useless words here 4 2 four useless
5: five useless meaningless words here 5 4 five useless meaningless words
6: six useless meaningless words here just 6 2 six useless
7: seven useless meaningless words here just to 7 3 seven useless meaningless
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here
output
1:
2: words
3: here
4: words
5: here
6: meaningless
7: words
8: to
9: fill
10: just
Unfortunately what I get instead is this:
string word.count stop.at.word input output
1: one 1 0
2: two words 2 1 NA NA
3: three words here 3 2 NA NA
4: four useless words here 4 2 NA NA
5: five useless meaningless words here 5 4 NA NA
6: six useless meaningless words here just 6 3 NA NA
7: seven useless meaningless words here just to 7 3 NA NA
8: eigth useless meaningless words here just to fill 8 6 NA NA
9: nine useless meaningless words here just to fill up 9 7 NA NA
10: ten useless meaningless words here just to fill up space 10 5 ten NA
Notice the inconsistent results, with an empty string on row 1 and "ten" returned on row 10.
Here is the code I am using:
texts.dt[, c("input", "output") := .(
substr(string,
1,
sapply(gregexpr(" ", string),"[", stop.at.word) - 1),
substr(string,
sapply(gregexpr(" ", string),"[", stop.at.word),
sapply(gregexpr(" ", string),"[", stop.at.word + 1) - 1)
)]
I ran many tests and the substr
instructions work well when I try individual strings in the console, but fail when applied to the data.table.
I suspect I am missing something related to scoping within data.table, but I haven't been using this package for long so I am quite confused.
I would greatly appreciate some assistance. Thanks in advance!
I would probably do
texts.dt[stop.at.word > 0, c("input","output") := {
sp = strsplit(string, " ")
list(
mapply(function(p,n) paste(p[seq_len(n)], collapse = " "), sp, stop.at.word),
mapply(`[`, sp, stop.at.word+1L)
)
}]
# partial result
head(texts.dt, 4)
string word.count stop.at.word input output
1: one 1 0 NA NA
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words
Alternately:
library(stringi)
texts.dt[stop.at.word > 0, c("input","output") := {
patt = paste0("((\\w+ ){", stop.at.word-1, "}\\w+) (.*)")
m = stri_match(string, regex = patt)
list(m[, 2], m[, 4])
}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With