Merging through fuzzy matching of variables in R

Tags:

I have two dataframes (x & y) where the IDs are student_name, father_name and mother_name. Because of typographical errors ("n" instead of "m", random white spaces, etc.), I have about 60% of values which are not aligning, though I can eyeball the data and see they should. Is there a way to reduce the level of non-match somehow so that manually editing because at least feasible? The dataframes are have about 700K observations.

R would be best. I know a little bit of python, and some basic unix tools. P.S. I read up on agrep(), but don't understand how that can work on actual datasets, especially when the match is over more than one variable.

update (data for posted bounty):

Here are two example data frames, sites_a and sites_b. They could be matched on the numeric columns lat and lon as well as on the sitename column. It would be useful to know how this could be done on a) just lat + lon, b) sitename or c) both.

you can source the file test_sites.R which is posted as a gist.

Ideally the answer would end with

merge(sites_a, sites_b, by = **magic**)

582

asked Aug 19 '11 10:08

user702432

2 Answers

The agrep function (part of base R), which does approximate string matching using the Levenshtein edit distance is probably worth trying. Without knowing what your data looks like, I can't really suggest a working solution. But this is a suggestion... It records matches in a separate list (if there are multiple equally good matches, then these are recorded as well). Let's say that your data.frame is called df:

l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
  father_id <- with(df,which(student_name[i] == father_name))
  if(length(father_id) == 1){
    matches[['father']][[i]] <- father_id
  } else {
    old_father_id <- NULL
    ## try to find the total                                                                                                                                 
    for(m in 10:1){ ## m is the maximum distance                                                                                                             
      father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
      if(length(father_id) == 1 || m == 1){
        ## if we find a unique match or if we are in our last round, then stop                                                                               
        matches[['father']][[i]] <- father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) > 0) {
        ## if we can't do better than multiple matches, then record them anyway                                                                              
        matches[['father']][[i]] <- old_father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) == 0) {
        ## if the nearest match is more than 10 different from the current pattern, then stop                                                                
        break
      }
    }
  }
}

The code for the mother_name would be basically the same. You could even put them together in a loop, but this example is just for the purpose of illustration.

170

answered Sep 25 '22 06:09

nullglob

This takes a list of common column names, matches based on agrep of all those columns combined, and then if all.x or all.y equals TRUE it appends non-matching records filling in missing columns with NA. Unlike merge, the column names to match on need to be the same in each data frame. The challenge would seem to be setting the agrep options correctly to avoid spurious matches.

  agrepMerge <- function(df1, df2, by, all.x = FALSE, all.y = FALSE, 
    ignore.case = FALSE, value = FALSE, max.distance = 0.1, useBytes = FALSE) {

    df1$index <- apply(df1[,by, drop = FALSE], 1, paste, sep = "", collapse = "")
    df2$index <- apply(df2[,by, drop = FALSE], 1, paste, sep = "", collapse = "")

    matches <- lapply(seq_along(df1$index), function(i, ...) {
      agrep(df1$index[i], df2$index, ignore.case = ignore.case, value = value,
            max.distance = max.distance, useBytes = useBytes)
    })

    df1_match <- rep(1:nrow(df1), sapply(matches, length))
    df2_match <- unlist(matches)

    df1_hits <- df1[df1_match,]
    df2_hits <- df2[df2_match,]

    df1_miss <- df1[setdiff(seq_along(df1$index), df1_match),]
    df2_miss <- df2[setdiff(seq_along(df2$index), df2_match),]

    remove_cols <- colnames(df2_hits) %in% colnames(df1_hits)

    df_out <- cbind(df1_hits, df2_hits[,!remove_cols])

    if(all.x) {
      missing_cols <- setdiff(colnames(df_out), colnames(df1_miss))
      df1_miss[missing_cols] <- NA
      df_out <- rbind(df_out, df1_miss)
    }
    if(all.x) {
      missing_cols <- setdiff(colnames(df_out), colnames(df2_miss))
      df2_miss[missing_cols] <- NA
      df_out <- rbind(df_out, df2_miss)
    }
    df_out[,setdiff(colnames(df_out), "index")]
}

answered Sep 26 '22 06:09

SchaunW

Related questions
                            
                                Configuration error when installing R on Linux [closed]
                            
                                *Almost* Perfect C Shell Piping
                            
                                Can inode and crtime be used as a unique file identifier?
                            
                                gcc - /usr/bin/ld error: cannot find <library> in /usr/local/lib though ldconfig list it, and path added to ld.so.conf
                            
                                Maximum size of string can be printed using %s?
                            
                                How to know if one shared library depends on another shared library or not?
                            
                                Answer to password shell prompt in golang
                            
                                Find directory for "application data" on linux and macintosh
                            
                                Python command line: editing mistake on previous line?
                            
                                How is each process pinned to a specific core by scheduler (Linux)
                            
                                How to read low level mouse click position in linux .
                            
                                Device node at /dev/tty* not getting created for uart serial driver
                            
                                Generate LLVM C++ API code as backend
                            
                                How to build Qt 5 statically
                            
                                Linux /opt/* folder and permissions [closed]
                            
                                C# execute a terminal command in linux
                            
                                undefined reference to '_asan_init_v4' when compiling
                            
                                Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory
                            
                                debugging templates with GDB
                            
                                using "touch" to create directories? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Merging through fuzzy matching of variables in R

Tags:

linux

unix

r

user702432

People also ask

2 Answers

nullglob

SchaunW

Recent Activity

Donate For Us