how to rearrange an order of matches between two data frames

Tags:

I have been busy with this question since last night and I could not figure out how to do it.

What I want to do is to match df1 strings to df2 strings and get the similar ones out

what I do is like this

# a function to arrange the data to have IDs for each string 
    normalize <- function(x, delim) {
      x <- gsub(")", "", x, fixed=TRUE)
      x <- gsub("(", "", x, fixed=TRUE)
      idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
      names <- unlist(strsplit(as.character(x), delim))
      return(setNames(idx, names))
    }

# a function to arrange the second df  
lookup <- normalize(df2[,1], ",")

# a function to match them and give the IDs
process <- function(s) {
  lookup_try <- lookup[names(s)]
  found <- which(!is.na(lookup_try))
  pos <- lookup_try[names(s)[found]]
  return(paste(s[found], pos, sep="-"))
  #change the last line to "return(as.character(pos))" to get only the result as in the comment
}

then I get the results like this

res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))

This gives me the row number of each string from df1 and row number of string from df2 that matched. so the output of this data looks like this

> res
$s1
[1] "3-4" "4-1" "5-4"

$s2
[1] "2-4"  "3-15" "7-16"

The first column IDs is the row number of df2 which matched with strings in df1 The second column No is the number of times it matched The third column ID-col-n is the row number of string in df1 which matched with that string + their column name the forth is string from first column of the df1 which matched with that string the fifth column is the string of second column which matched with that string and so on

891

asked Feb 29 '16 18:02

nik

1 Answers

In this case I find it easier to switch the data to the wide format and before merging it to the lookup table.

You could try:

library(tidyr)
library(dplyr)
df1_tmp <- df1
df2_tmp <- df2
#add numerical id to df1_tmp to keep row information
df1_tmp$id <- seq_along(df1_tmp[,1])

#switch to wide and unnest rows with several strings
df1_tmp <- gather(df1_tmp,key="s_val",value="query_string",-id)
df1_tmp <- df1_tmp %>% 
        mutate(query_string = strsplit(as.character(query_string), ";")) %>% 
        unnest(query_string)


df2_tmp$IDs. <- gsub("[()]", "", df2_tmp$IDs.)

#add numerical id to df1_tmp to keep row information
df2_tmp$id <- seq_along(df2_tmp$IDs.)

#unnest rows with several strings
df2_tmp <- df2_tmp %>% 
        mutate(IDs. = strsplit(as.character(IDs.), ",")) %>% 
        unnest(IDs.)

res <- merge(df1_tmp,df2_tmp,by.x="query_string",by.y="IDs.")

res$ID_col_n <- paste(paste0(res$id.x,res$s_val))
res$total_id <- 1:nrow(res)
res <- spread(res,s_val,value=query_string,fill=NA)
res
#summarize to get required output 

res <- res %>% group_by(id.y) %>%
        mutate(No=n())  %>% group_by(id.y,No) %>%
        summarise_each(funs(paste(.[!is.na(.)],collapse=","))) %>% 
        select(-id.x,-total_id)

colnames(res)[colnames(res)=="id.y"]<-"IDs"

res$df1_colMatch_counts <- rowSums(res[,-(1:3)]!="")
df2_counts <- df2_tmp %>% group_by(id) %>% summarize(df2_string_counts=n())
res <- merge(res,df2_counts,by.x="IDs",by.y="id")
res


res

  IDs No    ID_col_n            s1     s2 df1_colMatch_counts df2_string_counts
1   1  1         4s1        P41182                          1                 2
2   2  1         4s1        P41182                          1                 2
3   3  1         4s1        P41182                          1                 2
4   4  3 2s2,3s1,5s1 Q9Y6Q9,Q09472 Q92831                   2                 4
5  15  1         3s2               P54612                   1                 5
6  16  1         7s2               O15143                   1                 7

answered Oct 12 '22 12:10

NicE

Related questions
                            
                                SparkR filterRDD and flatMap not working
                            
                                Are rCharts and DT compatible in rmarkdown?
                            
                                Enabling vignette compression for R CMD build in RStudio
                            
                                Unexpected Convolution Results
                            
                                What does "argument to 'which' is not logical" mean in FactoMineR MCA?
                            
                                How to move out of auto-completed quotes or parentheses in RStudio?
                            
                                Trouble with strings with <U+0092> Unicode characters
                            
                                Code chunk font size in Beamer with knitr and latex
                            
                                collect only if query returns less than n_max rows
                            
                                How to change the order of the panels in simple Lattice graphs
                            
                                Is there an implementation of Hadley's ddply for python?
                            
                                Difference between installing a package from source and from compiled binary [duplicate]
                            
                                R connecting to EC2 instance for parallel processing
                            
                                "Incorrect number of dimensions" error, help me understand why
                            
                                How to avoid implicit character conversion when using apply on dataframe
                            
                                Behavior of <- NULL on lists versus data.frames for removing data
                            
                                How can I suppress the creation of a plot while calling a function in R?
                            
                                Unable to launch SparkR in RStudio
                            
                                increasing the distance between igraph nodes
                            
                                Save leaflet map in Shiny

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to rearrange an order of matches between two data frames

Tags:

list

dataframe

r

nik

People also ask

1 Answers

NicE

Recent Activity

Donate For Us