I am new to text mining, R and the tidy approach and am looking for kind advice to overcome a hurdle with pre-processing text strings read in from pdf files. The specific problem is with a multiple string replacement over multiple strings.
I have data from 2 sources:
My aim is to amend the current character strings in my main data frame, replacing strings which match the professional words in target_vocab with the associated compound token in replace_token prior to tokenization.
String example - before and after string substitution:
It is hopefully clear that I want "social workers", "early help", "multi-agency", "child in need" and "social worker" replaced with compound tokens.
My code:
#a bank of pdf reports and "professional_words.csv" in current working directory
library(tidyverse)
library(pdftools)
#> Using poppler version 0.73.0
library(tidytext)
library(stringr)
pdf_filenames <- list.files(pattern = "pdf$")
words_df <- read_csv("professional_words.csv", skip = 1, col_names = c("target_vocab", "replace_token"))
pattern_vector <- words_df$target_vocab
replacement_vector <- words_df$replace_token
pdf_pages_df <- map_df(pdf_filenames, ~ tibble(page_string = pdf_text(.x)) %>%
mutate(filename = .x, pagenumber = row_number()) %>%
mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector)))
The bit that doesn't work within the map function is:
mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector)))
I have tried all sorts of variations, including gsub, breaking it away from the pipe to a separate map function etc. but with my limited knowledge I am not fixing it.
I have consistently had the warning:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement), : longer object length is not a multiple of shorter object length
With this variation of code I am also getting the error:
Problem with
mutate()inputpage_string. x Inputpage_stringcan't be recycled to size 10. ℹ Inputpage_stringisstr_replace_all(page_string, pattern = pattern_vector, replacement = replace_vector). ℹ Inputpage_stringmust be size 10 or 1, not 77.
My sense is that map or list functions will help me but I seem to be going round in circles and I haven't yet found a Stack Overflow response that has helped me fix the problem.
There is a way to do what you want with str_replace_all from stringr. Instead of providing a pattern and a replacement, pass a named vector to pattern. Something like pattern = c("social worker" = social_worker", "early help" = "early_help", "multi agency" = "multi_agency"). I'll start with a simple example, and then show you how to have R build that named vector from your words_df.
# Simple example
library(stringr)
string <- "The quick brown fox"
str_replace_all(string, pattern = c("brown" = "green", "fox" = "badger"))
[1] "The quick green badger"
Here is how you do it with some fake data that looks like yours, having R build the named replacement vector.
# Making the fake data
words_df <- data.frame(target = c("fox", "brown", "quick"),
replacement = c("badger", "green", "versatile"))
strings_df <- data.frame(page_string = c("The quick brown fox",
"The sad yellow fox",
"The quick old dog",
"The lazy brown dog",
"The quick happy fox"))
# Making the named replacement vector from words_df
replacements <- c(words_df$replacement)
names(replacements) <- c(words_df$target)
# Doing the replacement
library(dplyr)
strings_df %>%
mutate(new_string = str_replace_all(page_string,
pattern = replacements))
# The output
page_string new_string
1 The quick brown fox The versatile green badger
2 The sad yellow fox The sad yellow badger
3 The quick old dog The versatile old dog
4 The lazy brown dog The lazy green dog
5 The quick happy fox The versatile happy badger
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With