In a dataframe I have a list of strings that are similar to each other but separated by the difference of a %. I would like to combine these common strings into a single string that has the most common character at each location.
The dataframe looks like so:
pattern Freq score rank
DT%E 37568 1138.4242 1
%TGE 37666 1018.0000 2
D%GE 37641 1017.3243 3
DTG% 37665 965.7692 4
%VGNE 34234 684.6800 5
SVGN% 34281 634.8333 6
SV%NE 34248 634.2222 7
SVG%E 34265 623.0000 8
%LGNE 41098 595.6232 9
SL%NE 41086 595.4493 10
SLGN% 41200 564.3836 11
SPT%AYNE 35082 539.7231 12
SP%AAYNE 35094 531.7273 13
SPTA%YNE 35061 531.2273 14
SPTAA%NE 35225 518.0147 15
SPTAAYN% 35144 516.8235 16
%PTAAYNE 35111 516.3382 17
S%TAAYNE 35100 516.1765 18
SPTAAY%E 35130 509.1304 19
SLG%E 41467 450.7283 20
I am trying to add another column with the most likely string from the pattern column
pattern Freq score rank true_string
DT%E 37568 1138.4242 1 DTGE
%TGE 37666 1018.0000 2 DTGE
D%GE 37641 1017.3243 3 DTGE
DTG% 37665 965.7692 4 DTGE
%VGNE 34234 684.6800 5 SVGNE
SVGN% 34281 634.8333 6 SVGNE
SV%NE 34248 634.2222 7 SVGNE
SVG%E 34265 623.0000 8 SVGNE
%LGNE 41098 595.6232 9 SLGNE
SL%NE 41086 595.4493 10 SLGNE
SLGN% 41200 564.3836 11 SLGNE
SPT%AYNE 35082 539.7231 12 SPTAAYNE
SP%AAYNE 35094 531.7273 13 SPTAAYNE
SPTA%YNE 35061 531.2273 14 SPTAAYNE
SPTAA%NE 35225 518.0147 15 SPTAAYNE
SPTAAYN% 35144 516.8235 16 SPTAAYNE
%PTAAYNE 35111 516.3382 17 SPTAAYNE
S%TAAYNE 35100 516.1765 18 SPTAAYNE
SPTAAY%E 35130 509.1304 19 SPTAAYNE
SLG%E 41467 450.7283 20 SLGNE
This is a tricky but interesting question.
Here is something that should give you some ideas (and reproduces your expected output); please note however that this is somewhat of an empirical approach that makes the following assumptions:
There are always >=2
patterns belonging to the same true_string
; this is necessary for the (hierarchical) clustering method to work (see below). If you have <2
patterns defining a true_string
this will not work, which makes sense because you'd have equal frequencies for two characters at the same position to occur.
All pattern
s have the same lengths; i.e. we only consider single character substitutions but no insertions/deletions.
We make use of the library stringdist
to calculate string similarities. stringdistmatrix
offers various distance metrics (Levenshtein, Hamming, ..., see ?stringdist::stringdistmatrix
for details). In this case, we use method = "qgram"
because it results in a grouping that is consistent with your expected output (hence the earlier "empirical" warning). I don't know how well this will generalise for your real data, so it is important to keep in mind that you may have to play around with different method
s to find a distance similarity metric that "fits" with your expectation.
After we've calculated the string distance matrix, we then cluster strings using hierarchical clustering; we add grp
labels based on cutting the tree at a vertical distance of v = 2
, and then use a custom get_consensus_string
function to infer a consensus string per grp
; as stated in the beginning, the function assumes that all strings within one grp
have the same length, and for every position in the string selects the character with the largest occurrence frequency.
First the custom get_consensus_string
function
library(tidyverse)
get_consensus_string <- function(x) {
map_dfc(x, str_split, "") %>%
rowid_to_column("pos") %>%
gather(k, v, -pos) %>%
group_by(pos, v) %>%
add_count() %>%
group_by(pos) %>%
filter(n == max(n)) %>%
arrange(pos, desc(v)) %>%
dplyr::slice(1) %>%
pull(v) %>%
paste0(collapse = "")
}
We can now add grp
labels based on the hierarchical clustering results of the string similarity distance matrix from stringdist::stringdistmatrix
; I empirically cut the tree here at a vertical distance of v = 2
(this is a parameter that may need tuning); once we have the grp
labels we add the consensus string.
library(stringdist)
df %>%
mutate(grp = cutree(hclust(stringdistmatrix(df$pattern, method = "qgram")), h = 2)) %>%
group_by(grp) %>%
mutate(true_string = get_consensus_string(pattern)) %>%
ungroup()
## A tibble: 20 x 6
# pattern Freq score rank grp true_string
# <fct> <int> <dbl> <int> <int> <chr>
# 1 DT%E 37568 1138. 1 1 DTGE
# 2 %TGE 37666 1018 2 1 DTGE
# 3 D%GE 37641 1017. 3 1 DTGE
# 4 DTG% 37665 966. 4 1 DTGE
# 5 %VGNE 34234 685. 5 2 SVGNE
# 6 SVGN% 34281 635. 6 2 SVGNE
# 7 SV%NE 34248 634. 7 2 SVGNE
# 8 SVG%E 34265 623 8 2 SVGNE
# 9 %LGNE 41098 596. 9 3 SLGNE
#10 SL%NE 41086 595. 10 3 SLGNE
#11 SLGN% 41200 564. 11 3 SLGNE
#12 SPT%AYNE 35082 540. 12 4 SPTAAYNE
#13 SP%AAYNE 35094 532. 13 4 SPTAAYNE
#14 SPTA%YNE 35061 531. 14 4 SPTAAYNE
#15 SPTAA%NE 35225 518. 15 4 SPTAAYNE
#16 SPTAAYN% 35144 517. 16 4 SPTAAYNE
#17 %PTAAYNE 35111 516. 17 4 SPTAAYNE
#18 S%TAAYNE 35100 516. 18 4 SPTAAYNE
#19 SPTAAY%E 35130 509. 19 4 SPTAAYNE
#20 SLG%E 41467 451. 20 3 SLGNE
You can see that the final code is very clean, and reproduces your expected output.
Two questions might be worth discussing: (1) How to choose the appropriate distance metric and (2) where to cut the tree.
Concerning the first question, an empirical approach would be to try different metrics and visualise the dendrogram after hierarchical clustering of the pattern
s.
For example, for method = "qgram"
you would do
mat <- as.matrix(stringdistmatrix(df$pattern, method = "qgram"))
rownames(mat) <- df$pattern
colnames(mat) <- df$pattern
plot(hclust(as.dist(mat)))
Once you're satisfied with the clustering results, we can move on.
In regards to cutting the tree, a practical/pragmatic approach would be to inspect the dendrogram and find a suitable height at which we cut the tree (in our case, v = 2
); alternatively if you know the number of unique true_string
s you can specify the number of groups in cutree
with k
.
In more technical terms, the height of a dendrogram is associated with the distance between groups using complete linkage (i.e. measuring distance based on the most dissimilar pairs). Since the distance between groups in turn is based on the q-gram-distances between pattern
s it is possible to relate the height back to the q-gram-distance between two pattern
s, i.e. the absolute difference between N-gram vectors of both pattern
s.
I checked the Maurits's answer, but when i added new row.
D%GT 12434 12421 22 DXGT
DX%T 31242 2221.2 21 DXGT
pattern Freq score rank true_string
DT%E 37568 1138.4242 1 DTGE
D%GT 12434 12421 22 DXGT
DX%T 31242 2221.2 21 DXGT
%TGE 37666 1018 2 DTGE
D%GE 37641 1017.3243 3 DTGE
DTG% 37665 965.7692 4 DTGE
%VGNE 34234 684.68 5 SVGNE
SVGN% 34281 634.8333 6 SVGNE
SV%NE 34248 634.2222 7 SVGNE
SVG%E 34265 623 8 SVGNE
%LGNE 41098 595.6232 9 SLGNE
SL%NE 41086 595.4493 10 SLGNE
SLGN% 41200 564.3836 11 SLGNE
SPT%AYNE 35082 539.7231 12 SPTAAYNE
SP%AAYNE 35094 531.7273 13 SPTAAYNE
SPTA%YNE 35061 531.2273 14 SPTAAYNE
SPTAA%NE 35225 518.0147 15 SPTAAYNE
SPTAAYN% 35144 516.8235 16 SPTAAYNE
%PTAAYNE 35111 516.3382 17 SPTAAYNE
S%TAAYNE 35100 516.1765 18 SPTAAYNE
SPTAAY%E 35130 509.1304 19 SPTAAYNE
SLG%E 41467 450.7283 20 SLGNE
df %>%
mutate(grp = cutree(hclust(stringdistmatrix(df$pattern, method = "qgram")), h = 2)) %>%
group_by(grp) %>%
mutate(true_string = get_consensus_string(pattern)) %>%
ungroup()
> Result
pattern Freq score rank grp true_string
1 DT%E 37568 1138. 1 1 DT%T
2 D%GT 12434 12421 22 1 DT%T
3 DX%T 31242 2221. 21 1 DT%T
4 %TGE 37666 1018 2 2 %TGE
5 D%GE 37641 1017. 3 2 %TGE
6 DTG% 37665 966. 4 1 DT%T
7 %VGNE 34234 685. 5 3 SVGNE
8 SVGN% 34281 635. 6 3 SVGNE
9 SV%NE 34248 634. 7 3 SVGNE
10 SVG%E 34265 623 8 3 SVGNE
11 %LGNE 41098 596. 9 4 SLGNE
12 SL%NE 41086 595. 10 4 SLGNE
13 SLGN% 41200 564. 11 4 SLGNE
14 SPT%AYNE 35082 540. 12 5 SPTAAYNE
15 SP%AAYNE 35094 532. 13 5 SPTAAYNE
16 SPTA%YNE 35061 531. 14 5 SPTAAYNE
17 SPTAA%NE 35225 518. 15 5 SPTAAYNE
18 SPTAAYN% 35144 517. 16 5 SPTAAYNE
19 %PTAAYNE 35111 516. 17 5 SPTAAYNE
20 S%TAAYNE 35100 516. 18 5 SPTAAYNE
21 SPTAAY%E 35130 509. 19 5 SPTAAYNE
22 SLG%E 41467 451. 20 4 SLGNE
From the above result it doesn't work.
library(dplyr)
library(data.table)
df <- fread(data)
string_pred <- function(x){
x = x %>% mutate(CL=nchar(pattern))
x_1 = x%>% select(pattern,CL)
Chr.length = unique(x_1$CL)
final_result = NULL
for ( len in 1:length(Chr.length)){
x_1_tmp = x %>% filter(CL==Chr.length[len])
RESULT = NULL
for(i in 1:Chr.length[len]){
TMP = substr(x_1_tmp$pattern,i,i)
TMP_GUESS = unique(TMP[!grepl("%",TMP)])
if(length(TMP_GUESS)==1){
TMP[grepl("%",TMP)] <- TMP_GUESS
} else {
TMP= TMP
}
NAME = sprintf('P%s',i)
RESULT = cbind(RESULT, NAME=TMP) %>% as.data.table()
names(RESULT)[i] = eval(parse(text='NAME'))
}
material = RESULT %>% rowwise() %>% .[apply(.,1,function(x){'%' %in% x}) ,]
if (nrow(material)==0){
x_1_tmp =x_1_tmp %>% mutate( pred = apply(RESULT,1,function(x)paste(as.character(x),collapse = ''))) %>% as.data.table()
} else {
mat.loc = RESULT %>% rowwise() %>%apply(.,1,function(x){'%' %in% x}) %>% which(unlist(.)==TRUE)
for (i in 1:nrow(material)){
ori.loc = mat.loc[i]
loc = names(material[i,])[material[i,]=='%']
tmp = material[i,] %>% dplyr::select(-loc)
RESULT[ori.loc,] = RESULT %>% rowwise() %>% inner_join(., tmp) %>% .[apply(.,1,function(x){!('%' %in% x)}) ,] %>% unique()
}
x_1_tmp = x_1_tmp %>%mutate( pred = apply(RESULT,1,function(x)paste(as.character(x),collapse = ''))) %>% as.data.table()
}
final_result = rbind(final_result, x_1_tmp)
}
return(final_result)
}
> string_pred(df)
pattern Freq score rank CL pred
1: DT%E 37568 1138.4242 1 4 DTGE
2: D%GT 12434 12421.0000 22 4 DXGT
3: DX%T 31242 2221.2000 21 4 DXGT
4: %TGE 37666 1018.0000 2 4 DTGE
5: D%GE 37641 1017.3243 3 4 DTGE
6: DTG% 37665 965.7692 4 4 DTGE
7: %VGNE 34234 684.6800 5 5 SVGNE
8: SVGN% 34281 634.8333 6 5 SVGNE
9: SV%NE 34248 634.2222 7 5 SVGNE
10: SVG%E 34265 623.0000 8 5 SVGNE
11: %LGNE 41098 595.6232 9 5 SLGNE
12: SL%NE 41086 595.4493 10 5 SLGNE
13: SLGN% 41200 564.3836 11 5 SLGNE
14: SLG%E 41467 450.7283 20 5 SLGNE
15: SPT%AYNE 35082 539.7231 12 8 SPTAAYNE
16: SP%AAYNE 35094 531.7273 13 8 SPTAAYNE
17: SPTA%YNE 35061 531.2273 14 8 SPTAAYNE
18: SPTAA%NE 35225 518.0147 15 8 SPTAAYNE
19: SPTAAYN% 35144 516.8235 16 8 SPTAAYNE
20: %PTAAYNE 35111 516.3382 17 8 SPTAAYNE
21: S%TAAYNE 35100 516.1765 18 8 SPTAAYNE
22: SPTAAY%E 35130 509.1304 19 8 SPTAAYNE
pattern Freq score rank CL
1 DT%E 37568 1138.4242 1 4
2 D%GT 12434 12421.0000 22 4
3 DX%T 31242 2221.2000 21 4
4 %TGE 37666 1018.0000 2 4
5 D%GE 37641 1017.3243 3 4
6 DTG% 37665 965.7692 4 4
TMP = substr(x_1_tmp$pattern,i,i)
[1] "D" "D" "D" "%" "D" "D"
unique(pattern[i] except % ) == 1
--> we allocate % as unique(pattern[i] except % )
P1 P2 P3 P4
1: D T G E
2: D % G T
3: D X G T
4: D T G E
5: D % G E
6: D T G %
unique(pattern[i] except % ) > 1
we check the other row in character length group. And we merged character (except % column) into other character.RESULT[ori.loc,] = RESULT %>% rowwise() %>%
inner_join(., tmp) %>%
.[apply(.,1,function(x){!('%' %in% x)}) ,] %>% unique()
>print
Joining, by = c("P1", "P3", "P4")
Source: local data frame [1 x 4]
Groups: <by row>
# A tibble: 1 x 4
P1 P2 P3 P4
<chr> <chr> <chr> <chr>
1 D X G T
%
is pattern Freq score rank CL pred
1: DT%E 37568 1138.4242 1 4 DTGE
2: D%GT 12434 12421.0000 22 4 DXGT
3: DX%T 31242 2221.2000 21 4 DXGT
4: %TGE 37666 1018.0000 2 4 DTGE
5: D%GE 37641 1017.3243 3 4 DTGE
6: DTG% 37665 965.7692 4 4 DTGE
My answer is not looks fancy, but it is working..
I recommend you just following the code one by one
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With