Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to combine similar strings showing most common characters

In a dataframe I have a list of strings that are similar to each other but separated by the difference of a %. I would like to combine these common strings into a single string that has the most common character at each location.

The dataframe looks like so:

pattern  Freq     score rank
DT%E 37568 1138.4242    1
%TGE 37666 1018.0000    2
D%GE 37641 1017.3243    3
DTG% 37665  965.7692    4
%VGNE 34234  684.6800    5
SVGN% 34281  634.8333    6
SV%NE 34248  634.2222    7
SVG%E 34265  623.0000    8
%LGNE 41098  595.6232    9
SL%NE 41086  595.4493   10
SLGN% 41200  564.3836   11
SPT%AYNE 35082  539.7231   12
SP%AAYNE 35094  531.7273   13
SPTA%YNE 35061  531.2273   14
SPTAA%NE 35225  518.0147   15
SPTAAYN% 35144  516.8235   16
%PTAAYNE 35111  516.3382   17
S%TAAYNE 35100  516.1765   18
SPTAAY%E 35130  509.1304   19
SLG%E 41467  450.7283   20

I am trying to add another column with the most likely string from the pattern column

pattern  Freq     score rank  true_string
DT%E 37568 1138.4242    1  DTGE
%TGE 37666 1018.0000    2  DTGE
D%GE 37641 1017.3243    3  DTGE
DTG% 37665  965.7692    4  DTGE
%VGNE 34234  684.6800    5  SVGNE
SVGN% 34281  634.8333    6  SVGNE
SV%NE 34248  634.2222    7  SVGNE
SVG%E 34265  623.0000    8  SVGNE
%LGNE 41098  595.6232    9  SLGNE
SL%NE 41086  595.4493   10  SLGNE
SLGN% 41200  564.3836   11  SLGNE
SPT%AYNE 35082  539.7231   12  SPTAAYNE
SP%AAYNE 35094  531.7273   13  SPTAAYNE
SPTA%YNE 35061  531.2273   14  SPTAAYNE
SPTAA%NE 35225  518.0147   15  SPTAAYNE
SPTAAYN% 35144  516.8235   16  SPTAAYNE
%PTAAYNE 35111  516.3382   17  SPTAAYNE
S%TAAYNE 35100  516.1765   18  SPTAAYNE
SPTAAY%E 35130  509.1304   19  SPTAAYNE
SLG%E 41467  450.7283   20  SLGNE
like image 740
El David Avatar asked May 13 '19 22:05

El David


2 Answers

This is a tricky but interesting question.

Here is something that should give you some ideas (and reproduces your expected output); please note however that this is somewhat of an empirical approach that makes the following assumptions:

  1. There are always >=2 patterns belonging to the same true_string; this is necessary for the (hierarchical) clustering method to work (see below). If you have <2 patterns defining a true_string this will not work, which makes sense because you'd have equal frequencies for two characters at the same position to occur.

  2. All patterns have the same lengths; i.e. we only consider single character substitutions but no insertions/deletions.

Approach

We make use of the library stringdist to calculate string similarities. stringdistmatrix offers various distance metrics (Levenshtein, Hamming, ..., see ?stringdist::stringdistmatrix for details). In this case, we use method = "qgram" because it results in a grouping that is consistent with your expected output (hence the earlier "empirical" warning). I don't know how well this will generalise for your real data, so it is important to keep in mind that you may have to play around with different methods to find a distance similarity metric that "fits" with your expectation.

After we've calculated the string distance matrix, we then cluster strings using hierarchical clustering; we add grp labels based on cutting the tree at a vertical distance of v = 2, and then use a custom get_consensus_string function to infer a consensus string per grp; as stated in the beginning, the function assumes that all strings within one grp have the same length, and for every position in the string selects the character with the largest occurrence frequency.

Code

First the custom get_consensus_string function

library(tidyverse)
get_consensus_string <- function(x) {
    map_dfc(x, str_split, "") %>%
        rowid_to_column("pos") %>%
        gather(k, v, -pos) %>%
        group_by(pos, v) %>%
        add_count() %>%
        group_by(pos) %>%
        filter(n == max(n)) %>%
        arrange(pos, desc(v)) %>%
        dplyr::slice(1) %>%
        pull(v) %>%
        paste0(collapse = "")
}

We can now add grp labels based on the hierarchical clustering results of the string similarity distance matrix from stringdist::stringdistmatrix; I empirically cut the tree here at a vertical distance of v = 2 (this is a parameter that may need tuning); once we have the grp labels we add the consensus string.

library(stringdist)
df %>%
    mutate(grp = cutree(hclust(stringdistmatrix(df$pattern, method = "qgram")), h = 2)) %>%
    group_by(grp) %>%
    mutate(true_string = get_consensus_string(pattern)) %>%
    ungroup()
## A tibble: 20 x 6
#   pattern   Freq score  rank   grp true_string
#   <fct>    <int> <dbl> <int> <int> <chr>
# 1 DT%E     37568 1138.     1     1 DTGE
# 2 %TGE     37666 1018      2     1 DTGE
# 3 D%GE     37641 1017.     3     1 DTGE
# 4 DTG%     37665  966.     4     1 DTGE
# 5 %VGNE    34234  685.     5     2 SVGNE
# 6 SVGN%    34281  635.     6     2 SVGNE
# 7 SV%NE    34248  634.     7     2 SVGNE
# 8 SVG%E    34265  623      8     2 SVGNE
# 9 %LGNE    41098  596.     9     3 SLGNE
#10 SL%NE    41086  595.    10     3 SLGNE
#11 SLGN%    41200  564.    11     3 SLGNE
#12 SPT%AYNE 35082  540.    12     4 SPTAAYNE
#13 SP%AAYNE 35094  532.    13     4 SPTAAYNE
#14 SPTA%YNE 35061  531.    14     4 SPTAAYNE
#15 SPTAA%NE 35225  518.    15     4 SPTAAYNE
#16 SPTAAYN% 35144  517.    16     4 SPTAAYNE
#17 %PTAAYNE 35111  516.    17     4 SPTAAYNE
#18 S%TAAYNE 35100  516.    18     4 SPTAAYNE
#19 SPTAAY%E 35130  509.    19     4 SPTAAYNE
#20 SLG%E    41467  451.    20     3 SLGNE

You can see that the final code is very clean, and reproduces your expected output.


Some further notes/comments

Two questions might be worth discussing: (1) How to choose the appropriate distance metric and (2) where to cut the tree.

Concerning the first question, an empirical approach would be to try different metrics and visualise the dendrogram after hierarchical clustering of the patterns.

For example, for method = "qgram" you would do

mat <- as.matrix(stringdistmatrix(df$pattern, method = "qgram"))
rownames(mat) <- df$pattern
colnames(mat) <- df$pattern
plot(hclust(as.dist(mat)))

enter image description here

Once you're satisfied with the clustering results, we can move on.

In regards to cutting the tree, a practical/pragmatic approach would be to inspect the dendrogram and find a suitable height at which we cut the tree (in our case, v = 2); alternatively if you know the number of unique true_strings you can specify the number of groups in cutree with k.

In more technical terms, the height of a dendrogram is associated with the distance between groups using complete linkage (i.e. measuring distance based on the most dissimilar pairs). Since the distance between groups in turn is based on the q-gram-distances between patterns it is possible to relate the height back to the q-gram-distance between two patterns, i.e. the absolute difference between N-gram vectors of both patterns.

like image 104
Maurits Evers Avatar answered Nov 17 '22 16:11

Maurits Evers


I checked the Maurits's answer, but when i added new row.

New Input

D%GT    12434   12421   22      DXGT
DX%T    31242   2221.2  21      DXGT

Used data

pattern Freq    score   rank        true_string
DT%E    37568   1138.4242   1       DTGE
D%GT    12434   12421   22      DXGT
DX%T    31242   2221.2  21      DXGT
%TGE    37666   1018    2       DTGE
D%GE    37641   1017.3243   3       DTGE
DTG%    37665   965.7692    4       DTGE
%VGNE   34234   684.68  5       SVGNE
SVGN%   34281   634.8333    6       SVGNE
SV%NE   34248   634.2222    7       SVGNE
SVG%E   34265   623 8       SVGNE
%LGNE   41098   595.6232    9       SLGNE
SL%NE   41086   595.4493    10      SLGNE
SLGN%   41200   564.3836    11      SLGNE
SPT%AYNE    35082   539.7231    12      SPTAAYNE
SP%AAYNE    35094   531.7273    13      SPTAAYNE
SPTA%YNE    35061   531.2273    14      SPTAAYNE
SPTAA%NE    35225   518.0147    15      SPTAAYNE
SPTAAYN%    35144   516.8235    16      SPTAAYNE
%PTAAYNE    35111   516.3382    17      SPTAAYNE
S%TAAYNE    35100   516.1765    18      SPTAAYNE
SPTAAY%E    35130   509.1304    19      SPTAAYNE
SLG%E   41467   450.7283    20      SLGNE

Maurits's answer

df %>%
    mutate(grp = cutree(hclust(stringdistmatrix(df$pattern, method = "qgram")), h = 2)) %>%
    group_by(grp) %>%
    mutate(true_string = get_consensus_string(pattern)) %>%
    ungroup()
> Result 
  pattern   Freq  score  rank   grp true_string
 1 DT%E     37568  1138.     1     1 DT%T       
 2 D%GT     12434 12421     22     1 DT%T       
 3 DX%T     31242  2221.    21     1 DT%T       
 4 %TGE     37666  1018      2     2 %TGE       
 5 D%GE     37641  1017.     3     2 %TGE       
 6 DTG%     37665   966.     4     1 DT%T       
 7 %VGNE    34234   685.     5     3 SVGNE      
 8 SVGN%    34281   635.     6     3 SVGNE      
 9 SV%NE    34248   634.     7     3 SVGNE      
10 SVG%E    34265   623      8     3 SVGNE      
11 %LGNE    41098   596.     9     4 SLGNE      
12 SL%NE    41086   595.    10     4 SLGNE      
13 SLGN%    41200   564.    11     4 SLGNE      
14 SPT%AYNE 35082   540.    12     5 SPTAAYNE   
15 SP%AAYNE 35094   532.    13     5 SPTAAYNE   
16 SPTA%YNE 35061   531.    14     5 SPTAAYNE   
17 SPTAA%NE 35225   518.    15     5 SPTAAYNE   
18 SPTAAYN% 35144   517.    16     5 SPTAAYNE   
19 %PTAAYNE 35111   516.    17     5 SPTAAYNE   
20 S%TAAYNE 35100   516.    18     5 SPTAAYNE   
21 SPTAAY%E 35130   509.    19     5 SPTAAYNE   
22 SLG%E    41467   451.    20     4 SLGNE   

From the above result it doesn't work.

My answer

library(dplyr)
library(data.table)

df <- fread(data)

string_pred <- function(x){
  
  x = x %>% mutate(CL=nchar(pattern)) 

  x_1 = x%>% select(pattern,CL)
  Chr.length = unique(x_1$CL)
  final_result = NULL
  for ( len in 1:length(Chr.length)){ 
    x_1_tmp = x %>% filter(CL==Chr.length[len])
    
    RESULT = NULL
    for(i in 1:Chr.length[len]){

      TMP = substr(x_1_tmp$pattern,i,i)
      TMP_GUESS = unique(TMP[!grepl("%",TMP)])
      if(length(TMP_GUESS)==1){
        TMP[grepl("%",TMP)] <- TMP_GUESS  
      } else {
        TMP= TMP
      }
      NAME = sprintf('P%s',i)
      
      RESULT = cbind(RESULT, NAME=TMP) %>% as.data.table()
      names(RESULT)[i] = eval(parse(text='NAME'))
    }
    material = RESULT %>% rowwise() %>% .[apply(.,1,function(x){'%' %in% x}) ,]
    if (nrow(material)==0){
      x_1_tmp =x_1_tmp %>%  mutate( pred = apply(RESULT,1,function(x)paste(as.character(x),collapse = ''))) %>% as.data.table()
    } else {
      mat.loc = RESULT %>% rowwise() %>%apply(.,1,function(x){'%' %in% x}) %>% which(unlist(.)==TRUE)
      
      for (i in 1:nrow(material)){
        ori.loc = mat.loc[i]
        loc = names(material[i,])[material[i,]=='%']
        tmp = material[i,] %>% dplyr::select(-loc)
        RESULT[ori.loc,] = RESULT %>% rowwise()  %>% inner_join(., tmp) %>% .[apply(.,1,function(x){!('%' %in% x)}) ,] %>% unique()
        
      }
      x_1_tmp = x_1_tmp %>%mutate( pred = apply(RESULT,1,function(x)paste(as.character(x),collapse = ''))) %>% as.data.table()
    }
    final_result = rbind(final_result, x_1_tmp)
  }
  return(final_result)
}

Result from my Asnwer

> string_pred(df)

     pattern  Freq      score rank CL     pred
 1:     DT%E 37568  1138.4242    1  4     DTGE
 2:     D%GT 12434 12421.0000   22  4     DXGT
 3:     DX%T 31242  2221.2000   21  4     DXGT
 4:     %TGE 37666  1018.0000    2  4     DTGE
 5:     D%GE 37641  1017.3243    3  4     DTGE
 6:     DTG% 37665   965.7692    4  4     DTGE
 7:    %VGNE 34234   684.6800    5  5    SVGNE
 8:    SVGN% 34281   634.8333    6  5    SVGNE
 9:    SV%NE 34248   634.2222    7  5    SVGNE
10:    SVG%E 34265   623.0000    8  5    SVGNE
11:    %LGNE 41098   595.6232    9  5    SLGNE
12:    SL%NE 41086   595.4493   10  5    SLGNE
13:    SLGN% 41200   564.3836   11  5    SLGNE
14:    SLG%E 41467   450.7283   20  5    SLGNE
15: SPT%AYNE 35082   539.7231   12  8 SPTAAYNE
16: SP%AAYNE 35094   531.7273   13  8 SPTAAYNE
17: SPTA%YNE 35061   531.2273   14  8 SPTAAYNE
18: SPTAA%NE 35225   518.0147   15  8 SPTAAYNE
19: SPTAAYN% 35144   516.8235   16  8 SPTAAYNE
20: %PTAAYNE 35111   516.3382   17  8 SPTAAYNE
21: S%TAAYNE 35100   516.1765   18  8 SPTAAYNE
22: SPTAAY%E 35130   509.1304   19  8 SPTAAYNE

Approach

  1. separate by each pattern's character length
  pattern  Freq      score rank CL
1    DT%E 37568  1138.4242    1  4
2    D%GT 12434 12421.0000   22  4
3    DX%T 31242  2221.2000   21  4
4    %TGE 37666  1018.0000    2  4
5    D%GE 37641  1017.3243    3  4
6    DTG% 37665   965.7692    4  4
  1. Inspect each charcter one by one.
 TMP = substr(x_1_tmp$pattern,i,i)
[1] "D" "D" "D" "%" "D" "D"
  1. If unique(pattern[i] except % ) == 1 --> we allocate % as unique(pattern[i] except % )
   P1 P2 P3 P4
1:  D  T  G  E
2:  D  %  G  T
3:  D  X  G  T
4:  D  T  G  E
5:  D  %  G  E
6:  D  T  G  %
  1. unique(pattern[i] except % ) > 1 we check the other row in character length group. And we merged character (except % column) into other character.
RESULT[ori.loc,] = RESULT %>% rowwise()  %>% 
                      inner_join(., tmp) %>%
                       .[apply(.,1,function(x){!('%' %in% x)}) ,] %>% unique()
>print
Joining, by = c("P1", "P3", "P4")
Source: local data frame [1 x 4]
Groups: <by row>

# A tibble: 1 x 4
  P1    P2    P3    P4   
  <chr> <chr> <chr> <chr>
1 D     X     G     T    
  1. Finally we can predict the what % is
   pattern  Freq      score rank CL pred
1:    DT%E 37568  1138.4242    1  4 DTGE
2:    D%GT 12434 12421.0000   22  4 DXGT
3:    DX%T 31242  2221.2000   21  4 DXGT
4:    %TGE 37666  1018.0000    2  4 DTGE
5:    D%GE 37641  1017.3243    3  4 DTGE
6:    DTG% 37665   965.7692    4  4 DTGE

My answer is not looks fancy, but it is working..

I recommend you just following the code one by one

like image 39
Steve Lee Avatar answered Nov 17 '22 15:11

Steve Lee