Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find matching strings between two vectors in R

I have two vectors in R. I want to find partial matches between them.

My Data

The first one is from a dataset named muc, which contains 6400 street names. muc$name looks like:

muc$name = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße",...)

The other vector is d_vector. It contains around 1400 names.

d_vector = "Abel", "Abendroth", "von Abercron", "Abetz", "Abicht", "Abromeit", ...

I want to find all the street names, that contain a name from d_vector somewhere in the street name.

First, I made some general adaptions after importing the csv data (as variable d):

d_vector <- unlist(d$name) d_vector <- as.vector(as.matrix(d_vector))

What I tried so far

  • Then I tried to find a solution with grep, turning d_vector into containing one long string, separated by | for RegEx-Search:

result <- unique(grep(paste(d_vector, collapse="|"), muc$Name, value=TRUE, ignore.case = TRUE)) result

But the result returns all the street names.

  • I also tried to use agrep, which retuned a Out of memory-Error.

  • When I tried d_vector %in% muc$nameit returned just one TRUE and hundreds of FALSE, which doesn't seem right.

Do you have any suggestion where my mistake could lay or which library I could use? I am looking for something like python's "fuzzywuzzy" for R

like image 834
Benedict Witzenberger Avatar asked Jul 14 '16 10:07

Benedict Witzenberger


People also ask

How do you find the matching element between two vectors in R?

Find positions of Matching Elements between Vectors in R Programming – match() Function. match() function in R Language is used to return the positions of the first match of the elements of the first vector in the second vector. If the element is not found, it returns NA.

How do you check if two vectors are equal in R?

setequal() function in R Language is used to check if two objects are equal. This function takes two objects like Vectors, dataframes, etc. as arguments and results in TRUE or FALSE, if the Objects are equal or not.

How do you find the intersection of two vectors in R?

intersect() function in R Language is used to find the intersection of two Objects. This function takes two objects like Vectors, dataframes, etc. as arguments and results in a third object with the common data of both the objects.


2 Answers

In principle, your solution works fine with some dummy data:

streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen", 
            "Konrad-Adenauer-Platz", "anotherThing")
patterns = c("weg", "platz")

unique(grep(paste(patterns, collapse="|"), streets, value=TRUE, ignore.case = TRUE))
[1] "Berberichweg"          "Otto-Klemperer-Weg"    "Konrad-Adenauer-Platz"

I think something is not quite in place for the d_vector. Try to check class(d_vector), or dput(d_vector) and paste that here.

You can also try using sapply and see if that will work:

matches =sapply(patterns, function(p) grep(p, streets, value=TRUE, ignore.case = TRUE))
# $weg
# [1] "Berberichweg"       "Otto-Klemperer-Weg"
# 
# $platz
# [1] "Konrad-Adenauer-Platz"

unique(unlist(matches))
# [1] "Berberichweg"          "Otto-Klemperer-Weg"    "Konrad-Adenauer-Platz"
like image 84
Deena Avatar answered Oct 03 '22 11:10

Deena


Simple solution:

streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße")
streets = tolower(streets) #Lowercase all
names = c("Berber", "Weg")
names = tolower(names)

sapply(names, function (y) sapply(streets, function (x) grepl(y, x)))

#                   berber   weg
#berberichweg        TRUE  TRUE
#otto-klemperer-weg  FALSE TRUE
#feldmeierbogen      FALSE FALSE
#altostraße          FALSE FALSE
like image 27
catastrophic-failure Avatar answered Oct 03 '22 11:10

catastrophic-failure