Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R, stringr, mutate (I think) - multiple partial string replacements in multiple strings

I am new to text mining, R and the tidy approach and am looking for kind advice to overcome a hurdle with pre-processing text strings read in from pdf files. The specific problem is with a multiple string replacement over multiple strings.

I have data from 2 sources:

  1. PDF reports: I have used map and pdf_text functions to read a directory of pdf reports into a data frame which creates a tibble with 3 columns: page_string, filename and pagenumber. There are 1,191 entries, and page_string holds a string being one page of pdf text.
  2. CSV file of professional words and replacements: I have used the read_CSV function to import this. The resultant df has 2 columns with 77 entries: target_vocab (e.g. social worker) and replace_token (e.g. social_worker).

My aim is to amend the current character strings in my main data frame, replacing strings which match the professional words in target_vocab with the associated compound token in replace_token prior to tokenization.

String example - before and after string substitution:

  1. "Social workers and early help staff work with multi-agency partners to produce child in need plans led by the allocated social worker".
  2. "Social_workers and early_help staff work with multi_agency partners to produce CIN plans led by the allocated social_worker".

It is hopefully clear that I want "social workers", "early help", "multi-agency", "child in need" and "social worker" replaced with compound tokens.

My code:

#a bank of pdf reports and "professional_words.csv" in current working directory

library(tidyverse)
library(pdftools)
#> Using poppler version 0.73.0
library(tidytext)
library(stringr)

pdf_filenames <- list.files(pattern = "pdf$")

words_df <- read_csv("professional_words.csv", skip = 1, col_names = c("target_vocab", "replace_token"))

pattern_vector <- words_df$target_vocab
replacement_vector <- words_df$replace_token 

pdf_pages_df <- map_df(pdf_filenames, ~ tibble(page_string = pdf_text(.x)) %>%
         mutate(filename = .x, pagenumber = row_number()) %>%
           mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector))) 

The bit that doesn't work within the map function is:

mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector)))

I have tried all sorts of variations, including gsub, breaking it away from the pipe to a separate map function etc. but with my limited knowledge I am not fixing it.

I have consistently had the warning:

In stri_replace_all_regex(string, pattern, fix_replacement(replacement), : longer object length is not a multiple of shorter object length

With this variation of code I am also getting the error:

Problem with mutate() input page_string. x Input page_string can't be recycled to size 10. ℹ Input page_string is str_replace_all(page_string, pattern = pattern_vector, replacement = replace_vector). ℹ Input page_string must be size 10 or 1, not 77.

My sense is that map or list functions will help me but I seem to be going round in circles and I haven't yet found a Stack Overflow response that has helped me fix the problem.

like image 583
Charlotte Waits Avatar asked May 12 '26 07:05

Charlotte Waits


1 Answers

There is a way to do what you want with str_replace_all from stringr. Instead of providing a pattern and a replacement, pass a named vector to pattern. Something like pattern = c("social worker" = social_worker", "early help" = "early_help", "multi agency" = "multi_agency"). I'll start with a simple example, and then show you how to have R build that named vector from your words_df.

# Simple example
library(stringr)
string <- "The quick brown fox"
str_replace_all(string, pattern = c("brown" = "green", "fox" = "badger"))
[1] "The quick green badger"

Here is how you do it with some fake data that looks like yours, having R build the named replacement vector.

# Making the fake data
words_df <- data.frame(target = c("fox", "brown", "quick"),
                       replacement = c("badger", "green", "versatile"))

strings_df <- data.frame(page_string = c("The quick brown fox",
                                         "The sad yellow fox",
                                         "The quick old dog",
                                         "The lazy brown dog",
                                         "The quick happy fox"))

# Making the named replacement vector from words_df
replacements <- c(words_df$replacement)
names(replacements) <- c(words_df$target)

# Doing the replacement
library(dplyr)
strings_df %>% 
  mutate(new_string = str_replace_all(page_string, 
                                      pattern = replacements))

# The output
          page_string                 new_string
1 The quick brown fox The versatile green badger
2  The sad yellow fox      The sad yellow badger
3   The quick old dog      The versatile old dog
4  The lazy brown dog         The lazy green dog
5 The quick happy fox The versatile happy badger
like image 112
Ben Norris Avatar answered May 15 '26 04:05

Ben Norris



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!