Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R match key values vector with irregular strings vector

I'm stuck in a nightmare I have been trying to find in the forum without success. So I try my chance by asking directly.

I have a vector containing irregular strings with random cities and I am would like to extract/label each of this irregular strings from a key values vector containing the city names. For example,

Vector <- c("...the life in Paris is ...","In Roma, there is...","...nice weekend in New York with...")
Cities <- c("London","Paris","Madrid","Roma","New York")

For each string in Vector, there should be a corresponding values from Cities to correspond.

I was thinking using loops at the beginning but the data size makes R search too long, I was more thinking of using a type of matricial calculation with grep but I always get errors.

Do you have an idea if this is the right way to go?

like image 715
jernac Avatar asked Apr 16 '26 11:04

jernac


2 Answers

You can use sapply and grepl:

check_vec <- sapply(Cities, grepl, Vector)
row.names(check_vec) <- Vector

check_vec
#                                    London Paris Madrid  Roma New York
#...the life in Paris is ...          FALSE  TRUE  FALSE FALSE    FALSE
#In Roma, there is...                 FALSE FALSE  FALSE  TRUE    FALSE
#...nice weekend in New York with...  FALSE FALSE  FALSE FALSE     TRUE

If you need the keyword for each vector:

apply(check_vec, 1, function (x) colnames(check_vec)[which(x)])
#        ...the life in Paris is ...                In Roma, there is... ...nice weekend in New York with... 
#                            "Paris"                              "Roma"                          "New York" 

edit

For a safer way, as wisely advised by @nicola, you can use vapply instead of sapply:

check_vec <- vapply(Cities, grepl, x=Vector, logical(length(Vector)))
like image 171
Cath Avatar answered Apr 19 '26 06:04

Cath


Here's a method using a text analysis package, quanteda. It allows you to set up a set of pattern matches for city names, which is useful for instance if you have different spellings of cities (e.g. "Rome" and "Roma") but want to count them as a single city. Below the matches use the simplified "glob" format, but you can also use regular expression matching.

require(quanteda)

# only required if you have compound word city names
compoundCities <- dictionary(list(NY = "New York"))
VectorPhrased <- phrasetotoken(Vector, compoundCities)

# uses the "glob" format for Pattern Matching
citiesDict <- dictionary(list(London = c("London", "Londres"), Paris = "Paris", 
                              Rome = "Rom?", NewYork = "New_York"))

dfm(VectorPhrased, dictionary = citiesDict, verbose = FALSE)
# Document-feature matrix of: 3 documents, 4 features.
# 3 x 4 sparse Matrix of class "dfmSparse"
#        features
# docs    London Paris Rome NewYork
#   text1      0     1    0       0
#   text2      0     0    1       0
#   text3      0     0    0       1
like image 22
Ken Benoit Avatar answered Apr 19 '26 06:04

Ken Benoit