I'm stuck in a nightmare I have been trying to find in the forum without success. So I try my chance by asking directly.
I have a vector containing irregular strings with random cities and I am would like to extract/label each of this irregular strings from a key values vector containing the city names. For example,
Vector <- c("...the life in Paris is ...","In Roma, there is...","...nice weekend in New York with...")
Cities <- c("London","Paris","Madrid","Roma","New York")
For each string in Vector, there should be a corresponding values from Cities to correspond.
I was thinking using loops at the beginning but the data size makes R search too long, I was more thinking of using a type of matricial calculation with grep but I always get errors.
Do you have an idea if this is the right way to go?
You can use sapply and grepl:
check_vec <- sapply(Cities, grepl, Vector)
row.names(check_vec) <- Vector
check_vec
# London Paris Madrid Roma New York
#...the life in Paris is ... FALSE TRUE FALSE FALSE FALSE
#In Roma, there is... FALSE FALSE FALSE TRUE FALSE
#...nice weekend in New York with... FALSE FALSE FALSE FALSE TRUE
If you need the keyword for each vector:
apply(check_vec, 1, function (x) colnames(check_vec)[which(x)])
# ...the life in Paris is ... In Roma, there is... ...nice weekend in New York with...
# "Paris" "Roma" "New York"
edit
For a safer way, as wisely advised by @nicola, you can use vapply instead of sapply:
check_vec <- vapply(Cities, grepl, x=Vector, logical(length(Vector)))
Here's a method using a text analysis package, quanteda. It allows you to set up a set of pattern matches for city names, which is useful for instance if you have different spellings of cities (e.g. "Rome" and "Roma") but want to count them as a single city. Below the matches use the simplified "glob" format, but you can also use regular expression matching.
require(quanteda)
# only required if you have compound word city names
compoundCities <- dictionary(list(NY = "New York"))
VectorPhrased <- phrasetotoken(Vector, compoundCities)
# uses the "glob" format for Pattern Matching
citiesDict <- dictionary(list(London = c("London", "Londres"), Paris = "Paris",
Rome = "Rom?", NewYork = "New_York"))
dfm(VectorPhrased, dictionary = citiesDict, verbose = FALSE)
# Document-feature matrix of: 3 documents, 4 features.
# 3 x 4 sparse Matrix of class "dfmSparse"
# features
# docs London Paris Rome NewYork
# text1 0 1 0 0
# text2 0 0 1 0
# text3 0 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With