Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting the differences between two string vectors

I've got a data_frame that looks like this.

df <- data_frame(name = c('john','bill','amy'),
           name.2 = c('johhn','ball','ammy')) 
df
# A tibble: 3 x 2
   name name.2
  <chr>  <chr>
1  john  johhn
2  bill   ball
3   amy   ammy

I want to add a column that shows the difference between the two name(.2) columns. Like this:

df %>% 
mutate(diff = c('h','a','m')) 
# A tibble: 3 x 3
   name name.2  diff
  <chr>  <chr> <chr>
1  john  johhn     h
2  bill   ball     a
3   amy   ammy     m

I'd prefer to find a solution that uses elements of tidyverse and stringr if possible, but I'll take it like I get it.

like image 652
elliot Avatar asked Apr 30 '26 12:04

elliot


1 Answers

Using base R we canndo something like:

diffc=diag(attr(adist(df$name,df$name.2, counts = TRUE), "trafos"))
transform(df,diff=regmatches(name.2,regexpr("[^M]",diffc)))
  name name.2 diff
1 john  johhn    h
2 bill   ball    a
3  amy   ammy    m

Breakdown:

compute approximate string distance between df[,1] and df[,2]

  d=adist(df$name,df$name.2, counts = TRUE)

obtain the diagonal of the transformation matrix:

   e= diag(attr(d, "trafos"))

Find the position of those that are either deleted,substituted or inserted ie not maintained:

    f=regexpr("[^M]",e)

extract the values of df[,2] at those specified positions:

     dat$diff==regmatches(name.2,f)
like image 193
KU99 Avatar answered May 02 '26 04:05

KU99



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!