Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subset text from a word docx AFTER a matching phrase

Tags:

r

ms-word

officer

I would like to subset text from an original word docx ("original.docx") into a new word docx ("desired.docx"), AFTER the match of the phrase "Drop Text Before Here", but keeping the formatting of the original (for the retained text).

I have modified the example from the {officer} package documentation for body_remove() to show the original and desired results (in docx form). The difference is that the example in the documentation keeps the portion of text before, and I would like to keep the text after the matched phrase.

library(officer)

# Original text
str1 <- rep("Lorem ipsum dolor sit amet, consectetur adipiscing elit. ", 3)
str1 <- paste(str1, collapse = "")

str2 <- "Drop Text Before Here"

str3 <- rep("Aenean venenatis varius elit et fermentum vivamus vehicula. ", 3)
str3 <- paste(str3, collapse = "")

# Create original_docx prior to subset
original_docx <- read_docx()
original_docx <- body_add_par(original_docx, value = str1, style = "Normal")
original_docx <- body_add_par(original_docx, value = str2, style = "centered")
original_docx <- body_add_par(original_docx, value = str3, style = "Normal")

# Save original docx in local directory
print(original_docx, "original.docx")

# Desired docx after subset starting at "Drop Text Before Here"
desired_docx <- read_docx()
desired_docx <- body_add_par(desired_docx, value = str2, style = "centered")
desired_docx <- body_add_par(desired_docx, value = str3, style = "Normal")

# Save desired docx in local directory
print(desired_docx, "desired.docx")

Created on 2022-04-09 by the reprex package (v2.0.1)

like image 772
David Lucey Avatar asked Oct 22 '25 10:10

David Lucey


1 Answers

You might use a custom function that tries to step backwards through the document from the current cursor position removing the body at each step and halting on the error that signifies the beginning of the document.

body_remove_before_cursor <- function(x) {
  tryCatch(
    {
      x <- officer::cursor_backward(x)
      x <- officer::body_remove(x)
      body_remove_before_cursor(x)
    },
    error = function(e) { 
      return(x)
    }
  )
}

desired_2_docx <- read_docx('original.docx')
desired_2_docx <- cursor_reach(desired_2_docx, str2)
desired_2_docx <- body_remove_before_cursor(desired_2_docx)
print(desired_2_docx, 'desired_2.docx')
like image 51
the-mad-statter Avatar answered Oct 25 '25 02:10

the-mad-statter