Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching and replacing strings from a list with another list in R

Tags:

r

I have two lists of strings, and would like to search a column with text to replace items in one string with items in a second string. The second string is identical to the first string, but includes tags for HTML formatting.

I have written a little function that attempts to grep for each item in the first list while replacing with the other, but that is not working well. I have also tried str_replace to no avail.

top_attribute_names<- c("Item Number \\(DPCI\\)", "UPC", "TCIN", "Product Form", "Health Facts", 
"Beauty Purpose", "Package Quantity", "Features", "Suggested Age", 
"Scent")

top_attributes_html<-ifelse(nchar(top_attribute_names)<30,paste("<b>",top_attribute_names,"</b>",sep=""),top_attribute_names) # List adding bold HTML tags for all strings with under 30 char

clean_free_description<-
c("Give your feathered friends a cozy new home with the Ceramic and Wood Birdhouse from Threshold. This simple birdhouse features a natural color scheme that helps it blend in with the tree you hang it from. The ceramic top is easy to remove when you want to clean out the birdhouse, while the small round hole lets birds in and keeps predators out. Sprinkle some seeds inside and watch your bird buddies become more permanent residents of your backyard.\nMaterial: Ceramic, Wood\nDimensions (Overall): 7.7 inches (H) x 8.5 inches (W) x 8.5 inches (L)\nWeight: 2.42 pounds\nAssembly Details: No assembly requiredpets subtype: Bird houses\nProtective Qualities: Weather-resistant\nMount Type: Hanging\nTCIN: 52754553\nUPC: 490840935721\nItem Number (DPCI): 084-09-3572\nOrigin: Imported\n", 
"House your parakeets in style with this Victorian-style bird cage. Featuring multiple colors and faux brickwork, the cage serves as a charming addition to your dcor. It's also equipped with two perches and feeding dishes, making it instantly functional.\nMaterial: Steel, Plastic\nDimensions (Overall): 21.5 inches (H) x 16.0 inches (W) x 16.0 inches (L)\nWeight: 15.0 pounds\nMaterial: Metal (Frame)\nIntended Pet Type: Bird\nIncludes: Feeding Dish, perch\nAssembly Details: Assembly required, no tools needed\nPets subtype: Bird cages\nBreed size: Small (0-25 pounds)\nSustainability Claims: Recyclable\nWarranty: 90 day limited warranty. To obtain a copy of the manufacturer's warranty for this item, please call Target Guest Services at 1-800-591-3869.\nWarranty Information:To obtain a copy of the manufacturer's warranty for this item, please call Target Guest Services at 1-800-591-3869.\nVictorian-style parakeet cage with 2 perches\nFeatures a molded base, a single front door and faux plastic brickwork\nMade of wire and plastic; 5/8\" spacing\nWash with soap and water18\nLx25.5\nHx18\nW\"TCIN: 10159211\nUPC: 048081002940\nItem Number (DPCI): 083-01-0167\n", 
"The Cockatiel Scalloped Top Bird Cage Kit is an ideal starter kit for cockatiels and other medium sized birds. Designer white scalloped style cage features large front door, easy to clean pull out tray, food and water dishes, wooden perches and swing. To help welcome and pamper your new bird, this starter kit also includes perch covers, kabob bird toy, cuttlebone, flavored mineral treat and a cement perch. Easy to assemble.\nMaterial: Metal\nDimensions (Overall): 27.25 inches (H) x 14.0 inches (W) x 18.25 inches (L)\nWeight: 11.0 pounds\nMaterial: Metal (Frame)\nIntended Pet Type: Bird\nPets subtype: Bird cages\nBreed size: All sizes\nTCIN: 16707833\nUPC: 030172016240\nItem Number (DPCI): 083-01-0248\n")

for(i in top_attribute_names){
  clean_free_description[grepl(i, clean_free_description)] <- top_attributes_html[i]
}

Theoretically, I thought I also be able to use str_replace to do this:

clean_free_description<-str_replace(clean_free_description,top_attribute_names,top_attributes_html)

But, that yields the error:

In stri_replace_first_regex(string, pattern, fix_replacement(replacement), : longer object length is not a multiple of shorter object length

And, of course, I'm sure there is a better solution that adds the HTML tags that eliminates a step by matching the strings in regex and adding text wrappers. Unfortunately, I'm not nearly good enough at Regex yet to figure that one out.

like image 871
roody Avatar asked Oct 17 '22 14:10

roody


2 Answers

You might try stringi::stri_replace_all as shown below. I have not plotted the full output here due to its length, but provided a shorter example to demonstrate the basic functionality, I hope this is what you were looking for.

UPDATE: I added a benchmark for the stringi and stringr solution, which makes clear why I have not sticked to your original code but introduced stringi here.

stringi::stri_replace_all_regex(c("a", "b", "c"),c("b", "c"),c("x", "y"), vectorize_all = F)
#[1] "a" "x" "y"

stringi::stri_replace_all_regex(clean_free_description,top_attribute_names,top_attributes_html, vectorize_all = F)

library(stringr)
library(stringi)

f_stringr = function() {
   names(top_attributes_html) <- top_attribute_names
   str_replace_all(clean_free_description, top_attributes_html)
}

f_stringi = function() {
  stri_replace_all_regex(clean_free_description,top_attribute_names,top_attributes_html, vectorize_all = F)
}

all.equal(f_stringr(), f_stringi())
# TRUE

microbenchmark::microbenchmark(
   f_stringr(), 
   f_stringi()
)
# Unit: microseconds
#        expr     min      lq      mean   median       uq      max neval
# f_stringr() 937.129 956.274 1041.7329 1053.579 1076.276 1296.743   100
# f_stringi() 122.767 128.491  136.6937  137.372  142.899  245.138   100
like image 102
Manuel Bickel Avatar answered Nov 15 '22 09:11

Manuel Bickel


I think this should do what you're looking for:

library(stringr)
names(top_attributes_html) <- top_attribute_names
str_replace_all(clean_free_description, top_attributes_html)
like image 26
sbha Avatar answered Nov 15 '22 08:11

sbha