I have a VERY unstructured text file that I read with readLines. I want to change certain strings to another string which is in a variable (called "new" below).
Below I want the manipulated text to include all terms: "one", "two", "three" and "four" once, instead of the "change" strings. However, as you can see sub changes the first pattern in each element, but I need the code to ignore that there are new strings with quotes.
See example code and data below.
#text to be changed
text <- c("TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT change",
"TEXT TEXT TEXT change TEXT TEXT TEXT TEXT TEXT change",
"TEXT TEXT TEXT change TEXT TEXT TEXT TEXT")
#Variable containing input for text
new <- c("one", "two", "three", "four")
#For loop that I want to include
for (i in 1:length(new)) {
text <- sub(pattern = "change", replace = new[i], x = text)
}
text
How about this? The logic is, hammer away a string until it has no more change
. On every "hit" (where change
is found), move along the new
vector.
text <- c("TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT change",
"TEXT TEXT TEXT change TEXT TEXT TEXT TEXT TEXT change",
"TEXT TEXT TEXT change TEXT TEXT TEXT TEXT")
#Variable containing input for text
new <- c("one", "two", "three", "four")
new.i <- 1
for (i in 1:length(text)) {
while (grepl(pattern = "change", text[i])) {
text[i] <- sub(pattern = "change", replacement = new[new.i], x = text[i])
new.i <- new.i + 1
}
}
text
[1] "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one"
[2] "TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three"
[3] "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"
Here is another solution using gregexpr()
and regmatches()
:
#text to be changed
text <- c("TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT change",
"TEXT TEXT TEXT change TEXT TEXT TEXT TEXT TEXT change",
"TEXT TEXT TEXT change TEXT TEXT TEXT TEXT")
#Variable containing input for text
new <- c("one", "two", "three", "four")
# Alter the structure of text
altered_text <- paste(text, collapse = "\n")
# So we can use gregexpr and regmatches to get what you want
matches <- gregexpr("change", altered_text)
regmatches(altered_text, matches) <- list(new)
# And here's the result
cat(altered_text)
#> TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one
#> TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three
#> TEXT TEXT TEXT four TEXT TEXT TEXT TEXT
# Or, putting the text back to its old structure
# (one element for each line)
unlist(strsplit(altered_text, "\n"))
#> [1] "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one"
#> [2] "TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three"
#> [3] "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"
We can do this since gregexpr()
can find all the matches in the text for "change"; from help("gregexpr")
:
regexpr returns an integer vector of the same length as text giving the starting position of the first match....
gregexpr returns a list of the same length as text each element of which is of the same form as the return value for regexpr, except that the starting positions of every (disjoint) match are given.
(emphasis added).
Then regmatches()
can be used to either extract the matches found by gregexpr()
or replace them; from help("regmatches")
:
Usage
regmatches(x, m, invert = FALSE)
regmatches(x, m, invert = FALSE) <- value...
value
an object with suitable replacement values for the matched or non-matched substrings (see Details)....
Details
The replacement function can be used for replacing the matched or non-matched substrings. For vector match data, if invert is FALSE, value should be a character vector with length the number of matched elements in m. Otherwise, it should be a list of character vectors with the same length as m, each as long as the number of replacements needed.
Another approach using strsplit
:
tl <- lapply(text, function(s) strsplit(s, split = " ")[[1]])
df <- stack(setNames(tl, seq_along(tl)))
ix <- df$values == "change"
df[ix, "values"] <- new
tapply(df$values, df$ind, paste, collapse = " ")
which gives:
1 "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one" 2 "TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three" 3 "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"
Additionally you could wrap the tapply
call in unname
:
unname(tapply(df$values, df$ind, paste, collapse = " "))
which gives:
[1] "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one" [2] "TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three" [3] "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"
If you want to use the elements of new
only once, you could update the code to:
newnew <- new[1:3]
ix <- df$values == "change"
df[ix, "values"][1:length(newnew)] <- newnew
unname(tapply(df$values, df$ind, paste, collapse = " "))
You could alter this further to also take into account the situation where there are more replacements than positions (occurences of the pattern, change
in the example) that need to be replaced:
newnew2 <- c(new, "five")
tl <- lapply(text, function(s) strsplit(s, split = " ")[[1]])
df <- stack(setNames(tl, seq_along(tl)))
ix <- df$values == "change"
df[ix, "values"][1:pmin(sum(ix),length(newnew2))] <- newnew2[1:pmin(sum(ix),length(newnew2))]
unname(tapply(df$values, df$ind, paste, collapse = " "))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With