I have a dataset with more than 20.000 observations that basically looks like this one:
df <- data.frame(
user = c("ABC", "DEF", "GHI"),
location = c("Chicago, the windy city", "Oxford University", "Paris")
)
I want to add three additional columns city, long, lat and fill these columns with the city name, and the geolocations (longitude and latitude).
Therefore I thought to use the maps package and its world.cities database:
library(maps)
data(world.cities)
Adding the city names and geolocation would not be difficult, if the city names in location would be displayed in the right way. However, most of them do have additional strings (e.g. "Chicago, the windy city").
How can I extract just the city names based on the world.cities database and write the real city name to the column city and the geolocations to long and lat?
As per mentioned by @Heroka in the comments, if the name of the city is always the first string in location, you could extract the first string using stringi, left_join the world.cities data, and filter for the largest population in the matches.
library(stringi)
library(dplyr)
df %>%
mutate(city = stri_extract_first_words(location)) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
Which gives:
#Source: local data frame [3 x 8]
#Groups: city [3]
#
# user location city country.etc pop lat long capital
# (fctr) (fctr) (chr) (chr) (int) (dbl) (dbl) (int)
#1 ABC Chicago, the windy city Chicago USA 2830144 41.84 -87.68 0
#2 DEF Oxford University Oxford UK 157568 51.76 -1.26 0
#3 GHI Paris Paris France 2141839 48.86 2.34 1
Update
If the name of the city is not always the first string in location, you could first try to match the words in location with a dictionary (here the name column in world.cities) and then use the matches that return TRUE as your location name. Here's a quick implementation (I added the "University College London" case to you data.frame)
> df
# user location
#1 ABC Chicago, the windy city
#2 DEF Oxford University
#3 GHI Paris
#4 JKL University College London
For each row, we extract all the words in location and store them in a list lst, loop over it to find the position of the matching name in world.cities and store it in p, and finally extract the element corresponding to position p in lst and store it in city
df %>%
mutate(lst = stri_extract_all_words(location),
p = sapply(lst, function (x) which(x %in% world.cities$name), simplify=TRUE)) %>%
mutate(city = sapply(1:length(lst), function(x) .$lst[[x]][.$p[x]])) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
You can also remove the temporary columns p and lst by adding ... %>% select(-lst, -p)
Update 2
This should not break on malformed words but won't work for the "New York" case:
df %>%
mutate(
city = lapply(stri_extract_all_words(location),
function (x) { world.cities$name[match(x, world.cities$name)] })) %>%
tidyr::unnest(city) %>%
filter(!is.na(city)) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
Update 3
This should work in all the cases:
> df
# user location
#1 ABC Chicago, the windy city
#2 DEF Oxford University
#3 GHI Paris
#4 JKL New York
#5 MNO m0ntr3al
#6 PQR University College London
df$l <- gsub("[^[:alnum:]]+", " ", df$location)
lst <- lapply(world.cities$name, function (x) { grep(x, df$l, value = TRUE) })
m <- data.table::melt(lst)
df %>%
left_join(m, by = c("l" = "value")) %>%
left_join(world.cities %>%
add_rownames %>%
mutate(rowname = as.numeric(rowname)),
by = c("L1" = "rowname")) %>%
tidyr::replace_na(list(pop = 0)) %>%
group_by(location) %>%
filter(row_number(desc(pop)) == 1) %>%
select(-(l:L1))
Which gives:
#Source: local data frame [6 x 8]
#Groups: location [6]
#
# user location name country.etc pop lat long capital
# (fctr) (fctr) (chr) (chr) (dbl) (dbl) (dbl) (int)
#1 ABC Chicago, the windy city Chicago USA 2830144 41.84 -87.68 0
#2 DEF Oxford University Oxford UK 157568 51.76 -1.26 0
#3 GHI Paris Paris France 2141839 48.86 2.34 1
#4 JKL New York New York USA 8124427 40.67 -73.94 0
#5 MNO m0ntr3al NA NA 0 NA NA NA
#6 PQR Univeristy College London London UK 7489022 51.52 -0.10 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With