Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract comma separated strings

I have data frame as below. This is a sample set data with uniform looking patterns but whole data is not very uniform:

locationid      address     
1073744023  525 East 68th Street, New York, NY      10065, USA
1073744022  270 Park Avenue, New York, NY 10017, USA      
1073744025  Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA 
1073744024  1251 Avenue of the Americas, New York, NY 10020, USA
1073744021  1301 Avenue of the Americas, New York, NY 10019, USA 
1073744026  44 West 45th Street, New York, NY 10036, USA

I need to find the city and country name from this address. I tried the following:

1) strsplit This gives me a list but I cannot access the last or third last element from this.

2) Regular expressions finding country is easy

str_sub(str_extract(address, "\\d{5},\\s.*"),8,11)

but for city

str_sub(str_extract(address, ",\\s.+,\\s.+\\d{5}"),3,comma_pos)

I cannot find comma_pos as it leads me to the same problem again. I believe there is a more efficient way to solve this using any of the above approached.

like image 720
Cagg Avatar asked Dec 14 '22 18:12

Cagg


2 Answers

Try this code:

library(gsubfn)

cn <- c("Id", "Address", "City", "State", "Zip", "Country")

pat <- "(\\d+) (.+), (.+), (..) (\\d+), (.+)"
read.pattern(text = Lines, pattern = pat, col.names = cn, as.is = TRUE)

giving the following data.frame from which its easy to pick off components:

          Id                                  Address     City State   Zip Country
1 1073744023                     525 East 68th Street New York    NY 10065     USA
2 1073744022                          270 Park Avenue New York    NY 10017     USA
3 1073744025 Rockefeller Center, 50 Rockefeller Plaza New York    NY 10020     USA
4 1073744024              1251 Avenue of the Americas New York    NY 10020     USA
5 1073744021              1301 Avenue of the Americas New York    NY 10019     USA
6 1073744026                      44 West 45th Street New York    NY 10036     USA

Explanation It uses this pattern (when within quotes the backslashes must be doubled):

(\d+) (.+), (.+), (..) (\d+), (.+)

visualized via the following debuggex railroad diagram -- for more see this Debuggex Demo :

Regular expression visualization

and explained in words as follows:

  • "(\\d+)" - one or more digits (representing the Id) followed by
  • " " a space followed by
  • "(.+)" - any non-empty string (representing the Address) followed by
  • ", " - a comma and a space followed by
  • "(.+)" - any non-empty string (representing the City) followed by
  • ", " - a comma and a space followed by
  • "(..)" - two characters (representing the State) followed by
  • " " - a space followed by
  • "(\\d+)" - one or more digits (representing the Zip) followed by
  • ", " - a comma and a space followed by
  • "(.+)" - any non-empty string (representing the Country)

It works since regular expressions are greedy always trying to find the longest string that can match backtracking each time subsequent portions of the regular expression fail to match.

The advantage of this appraoch is that the regular expression is quite simple and straight forward and the entire code is quite concise as one read.pattern statement does it all:

Note: We used this for Lines:

Lines <- "1073744023 525 East 68th Street, New York, NY 10065, USA
1073744022 270 Park Avenue, New York, NY 10017, USA
1073744025 Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA
1073744024 1251 Avenue of the Americas, New York, NY 10020, USA
1073744021 1301 Avenue of the Americas, New York, NY 10019, USA
1073744026 44 West 45th Street, New York, NY 10036, USA"
like image 182
G. Grothendieck Avatar answered Jan 02 '23 21:01

G. Grothendieck


Split the data

 ss <- strsplit(data,",")`

Then

n <- sapply(s,len)

will give the number of elements (so you can work backward). Then

mapply(ss,"[[",n)

gives you the last element. Or you could do

sapply(ss,tail,1)

to get the last element.

To get the second-to-last (or more generally) you need

sapply(ss,function(x) tail(x,2)[1])
like image 25
Ben Bolker Avatar answered Jan 02 '23 22:01

Ben Bolker