Replace words in an unstructured text file using a for loop

Question

I have a VERY unstructured text file that I read with readLines. I want to change certain strings to another string which is in a variable (called "new" below).

Below I want the manipulated text to include all terms: "one", "two", "three" and "four" once, instead of the "change" strings. However, as you can see sub changes the first pattern in each element, but I need the code to ignore that there are new strings with quotes.

See example code and data below.

 #text to be changed
 text <- c("TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT change",
        "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT TEXT change", 
        "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT")

 #Variable containing input for text
 new <- c("one", "two", "three", "four")
 #For loop that I want to include 
 for (i in 1:length(new)) {

   text  <- sub(pattern = "change", replace = new[i], x = text)

 }
 text

Roman Luštrik · Accepted Answer

How about this? The logic is, hammer away a string until it has no more change. On every "hit" (where change is found), move along the new vector.

text <- c("TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT change",
          "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT TEXT change", 
          "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT")

#Variable containing input for text
new <- c("one", "two", "three", "four")
new.i <- 1

for (i in 1:length(text)) {
  while (grepl(pattern = "change", text[i])) {
    text[i] <- sub(pattern = "change", replacement = new[new.i], x = text[i])
    new.i <- new.i + 1
  }
}
text

[1] "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one" 
[2] "TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three"
[3] "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"

duckmayr · Answer

Here is another solution using gregexpr() and regmatches():

#text to be changed
text <- c("TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT change",
          "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT TEXT change",
          "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT")

#Variable containing input for text
new <- c("one", "two", "three", "four")

# Alter the structure of text
altered_text <- paste(text, collapse = "
")

# So we can use gregexpr and regmatches to get what you want
matches <- gregexpr("change", altered_text)
regmatches(altered_text, matches) <- list(new)

# And here's the result
cat(altered_text)
#> TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one
#> TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three
#> TEXT TEXT TEXT four TEXT TEXT TEXT TEXT

# Or, putting the text back to its old structure
# (one element for each line)
unlist(strsplit(altered_text, "
"))
#> [1] "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one" 
#> [2] "TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three"
#> [3] "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"

We can do this since gregexpr() can find all the matches in the text for "change"; from help("gregexpr"):

regexpr returns an integer vector of the same length as text giving the starting position of the first match....

gregexpr returns a list of the same length as text each element of which is of the same form as the return value for regexpr, except that the starting positions of every (disjoint) match are given.

(emphasis added).

Then regmatches() can be used to either extract the matches found by gregexpr() or replace them; from help("regmatches"):

Usage

regmatches(x, m, invert = FALSE)
regmatches(x, m, invert = FALSE) <- value

...

value
an object with suitable replacement values for the matched or non-matched substrings (see Details).

...

Details

The replacement function can be used for replacing the matched or non-matched substrings. For vector match data, if invert is FALSE, value should be a character vector with length the number of matched elements in m. Otherwise, it should be a list of character vectors with the same length as m, each as long as the number of replacements needed.

Jaap · Answer

Another approach using strsplit:

tl <- lapply(text, function(s) strsplit(s, split = " ")[[1]])
df <- stack(setNames(tl, seq_along(tl)))

ix <- df$values == "change"
df[ix, "values"] <- new
tapply(df$values, df$ind, paste, collapse = " ")

which gives:

                                                  1 
 "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one" 
                                                  2 
"TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three" 
                                                  3 
          "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"

Additionally you could wrap the tapply call in unname:

 unname(tapply(df$values, df$ind, paste, collapse = " "))

which gives:

[1] "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one" 
[2] "TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three"
[3] "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"

If you want to use the elements of new only once, you could update the code to:

newnew <- new[1:3]

ix <- df$values == "change"
df[ix, "values"][1:length(newnew)] <- newnew
unname(tapply(df$values, df$ind, paste, collapse = " "))

You could alter this further to also take into account the situation where there are more replacements than positions (occurences of the pattern, change in the example) that need to be replaced:

newnew2 <- c(new, "five")

tl <- lapply(text, function(s) strsplit(s, split = " ")[[1]])
df <- stack(setNames(tl, seq_along(tl)))

ix <- df$values == "change"
df[ix, "values"][1:pmin(sum(ix),length(newnew2))] <- newnew2[1:pmin(sum(ix),length(newnew2))]
unname(tapply(df$values, df$ind, paste, collapse = " "))

Replace words in an unstructured text file using a for loop

Tags:

loops

for-loop

r

Gorp

3 Answers

Roman Luštrik

duckmayr

Jaap

Recent Activity

Donate For Us

Replace words in an unstructured text file using a for loop

Tags:

loops

for-loop

r

Gorp

3 Answers

Roman Luštrik

duckmayr

Jaap

Related questions

Recent Activity

Donate For Us