Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use regex to replace duplicate phrases

Tags:

string

regex

r

I need to parse large data files and for reasons unknown the addresses are sometimes repeated, like this:

d<- data.table(name = c("bill", "tom"), address = c("35 Valerie Avenue / 35 Valerie Avenue", "702 / 9 Paddock Street / 702 / 9 Paddock Street"))

I have figured out how to de-dupe the easy ones (e.g. "35 Valerie Avenue / 35 Valerie Avenue") with the following:

replace.dupe.addresses<- function(x){
  rep_expr<- "^(.*)/(.*)$"
  idx<- grepl("/",x) & (trimws(sub(rep_expr, "\\2", x)) == trimws(sub(rep_expr, "\\1",x)))
  x[idx]<- trimws(sub(rep_expr, "\\1",x[idx]))
  x
}

d[,address := replace.dupe.addresses(address)]

But this doesn't work for addresses where the critical '/' is further embedded. I have tried this as my regex: rep_expr<- "^(.*)[:alpha:][:space:]?/(.*)$" but this doesn't work. What regex expression would capture both of these repeating phrases?

like image 225
matto Avatar asked Mar 16 '26 15:03

matto


2 Answers

See if this works for your dataset

library(data.table)

d[, .(name, address = lapply(strsplit(address, " / "), function(x) 
  paste(x[!duplicated(x)], collapse=" / "))), by=.I]
   name                address
1: bill      35 Valerie Avenue
2:  tom 702 / 9 Paddock Street
like image 111
Andre Wildberg Avatar answered Mar 18 '26 05:03

Andre Wildberg


Please check the below code

d %>% separate_rows(address, sep = '\\/') %>% mutate(address=trimws(address)) %>% 
group_by(name, address) %>% slice_head(n=1) %>% group_by(name) %>% 
  mutate(address=paste(address, collapse = '/')) %>% slice_head(n=1)

Created on 2023-01-27 with reprex v2.0.2

# A tibble: 2 × 2
# Groups:   name [2]
  name  address             
  <chr> <chr>               
1 bill  35 Valerie Avenue   
2 tom   702/9 Paddock Street
like image 36
jkatam Avatar answered Mar 18 '26 03:03

jkatam