First of all I am sorry if my formatting is bad, this is my first time posting, (also new to programming & R)
I am trying to merge two data frames together on string variables. I am merging university names, which might not match up perfectly, so I was hoping to merge using a fuzzy or approximate string matching function. I was happy when I found the ‘fuzzyjoin’ package.
from cranR: stringdist_join: Join two tables based on fuzzy string matching of their columns
stringdist_join(x, y, by = NULL, max_dist = 2, method = c("osa", "lv",
"dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw","soundex"), mode = "inner", ignore_case = FALSE, distance_col = NULL, ...)
my code:
stringdist_left_join(new, institutions, by = c("tm_9_undergradu" = "Institution.Name"))
Error:
Error in dists[include] <- stringdist::stringdist(v1[include], v2[include], :
NAs are not allowed in subscripted assignments
I know that there are some NA's in these columns, but I am not sure how I could remove them as I need them there as well. I know it other join & merge functions the NA's will simply be ignored. Does anyone know a way to get around this error for this package or to do an approximate join on strings another way. Thank you for your help.
This answer worked for me and is from GitHub
Step 1: figure out which Df has the NAs
`which(is.na(df1))
which(is.na(df2))`
Step 2: replace NAs
with something else.
df1[is.na(df1)] <- "empty_string"
Step 3: run the join (the code I was working with when I got the error)
`test1 <- msa_table %>%
as_tibble() %>%
unlist() %>%
mutate(msa = sub("\\(.*)","", as.character(msa)) %>%
stringdist_full_join(msa_table, df1, by = 'msa', max_dist = 2)`
The result for me was not having the same error, but still having NAs
in my tables.
Hope this helps! Also, to be clear: this solution came from Anton Prokopyev '@prokopyev' on GitHub.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With