Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract city names from large text with R

Tags:

r

extract

Hello i have an intriguing question here. Suppose that i have a long character which includes city names between others.

test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"

My goal is to extract all the city names of it. And I achieved it by following five steps.

   #replace | with ,
   test2<-str_replace_all(test, "[|]", ", ")

   # Remove punctuation from data
   test3<-gsub("[[:punct:]\n]","",test2)

   # Split data at word boundaries
   test4 <- strsplit(test3, " ")

   # Load data from package maps
   data(world.cities)

   # Match on cities in world.cities
   citiestest<-lapply(test4, function(x)x[which(x %in% world.cities$name)])

The result may be correct

citiestest
[[1]]
 [1] "San"        "Boston"     "Boston"     "Washington" "York"      
 [6] "York"       "Kettering"  "York"       "York"       "Charlotte" 
[11] "Carolina"   "Cleveland"  "Nashville"  "Seattle"    "Seattle"   
[16] "Washington" "Asan"      

But as you can see I cannot deal with cities with two-words name (New York, San Diego etc.) as they are separated. Of course fix this issue manually is not an option as my real dataset is quite large.

like image 626
firmo23 Avatar asked Jan 26 '18 02:01

firmo23


2 Answers

A rather different approach which may be more or less useful, depending on the data at hand: Pass each address to a geocoding API, then pull the city out of the response.

library(tidyverse)

places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>% 
    separate_rows(string, sep = '\\|')

places <- places %>% 
    mutate(geodata = map(string, ~{Sys.sleep(1); ggmap::geocode(.x, output = 'all')}))

places <- places %>% 
    mutate(address_components = map(geodata, list('results', 1, 'address_components')),
           address_components = map(address_components, 
                                    ~as_data_frame(transpose(.x)) %>% 
                                        unnest(long_name, short_name)),
           city = map(address_components, unnest),
           city = map_chr(city, ~{
               l <- set_names(.x$long_name, .x$types); 
               coalesce(l['locality'], l['administrative_area_level_1'])
           }))

Comparing the result and the original,

places %>% select(city, string)
#> # A tibble: 17 x 2
#>    city       string                                                                               
#>    <chr>      <chr>                                                                                
#>  1 San Diego  Ucsd Medical Center, San Diego, California, USA                                      
#>  2 New Haven  Yale Cancer Center, New Haven, Connecticut, USA                                      
#>  3 Boston     Massachusetts General Hospital., Boston, Massachusetts, USA                          
#>  4 Boston     Dana Farber Cancer Institute, Boston, Massachusetts, USA                             
#>  5 St. Louis  Washington University, Saint Louis, Missouri, USA                                    
#>  6 New York   Mount SInai Medical Center, New York, New York, USA                                  
#>  7 New York   Memorial Sloan Kettering Cancer Center, New York, New York, USA                      
#>  8 Charlotte  Carolinas Healthcare System, Charlotte, North Carolina, USA                          
#>  9 Cleveland  University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville  Vanderbilt University Medical Center, Nashville, Tennessee, USA                      
#> 11 Seattle    Seattle Cancer Care Alliance, Seattle, Washington, USA                               
#> 12 Goyang-si  National Cancer Center, Gyeonggi-do, Korea, Republic of                              
#> 13 서울특별시 Seoul National University Hospital, Seoul, Korea, Republic of                        
#> 14 Seoul      Severance Hospital, Yonsei University Health System, Seoul, Korea,  Republic of       
#> 15 Seoul      Korea University Guro Hospital, Seoul, Korea, Republic of                            
#> 16 Seoul      Asan Medical Center., Seoul, Korea, Republic of                                      
#> 17 Amsterdam  VU MEDISCH CENTRUM; Dept. of Medical Oncology   

...well, it's not perfect. The biggest issue is that cities are classified as localities for US cities, but administrative_area_level_1 (which in the US is the state) for South Korea. Unlike the other Korean rows, 12 actually has a locality, which is not the city listed (which is in the response as an administrative region). Further, "Seoul" in line 13 was inexplicably translated to Korean.

The good news is that "Saint Louis" has been shortened to "St. Louis", which is a more standardized form, and the last row has been located in Amsterdam.

Scaling such an approach would likely require paying Google a little for the usage of their API.

like image 69
alistaire Avatar answered Nov 08 '22 23:11

alistaire


Here is a base R option using strsplit and sub:

terms <- unlist(strsplit(test, "\\s*\\|\\s*"))
cities <- sapply(terms, function(x) gsub("[^,]+,\\s*([^,]+),.*", "\\1", x))
cities[1:3]

            Ucsd Medical Center, San Diego, California, USA 
                                                "San Diego" 
            Yale Cancer Center, New Haven, Connecticut, USA 
                                                "New Haven" 
Massachusetts General Hospital., Boston, Massachusetts, USA
                                                   "Boston"

Demo

like image 2
Tim Biegeleisen Avatar answered Nov 09 '22 01:11

Tim Biegeleisen