Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regex / gsub : extract part of pattern

Tags:

regex

r

I have a list of weather stations and their locations by latitude and longitude. There was formatting issue and some of them have have hours and minutes while other have hours, minutes and seconds. I can find the pattern using regex but I'm having trouble extracting the individual pieces.

Here's data:

> head(wthrStat1 )
     Station       lat        lon
1940    K01R    31-08N    092-34W
1941    K01T    28-08N    094-24W
1942    K03Y    48-47N    096-57W
1943    K04V 38-05-50N 106-10-07W
1944    K05F 31-25-16N 097-47-49W
1945    K06D 48-53-04N 099-37-15W

I'd like something like this:

   Station       latHr latMin   latSec  latDir   lonHr lonMin  lonSec lonDir
    1940    K01R    31    08       00      N      092   34       00     W
    1941    K01T    28    08       00      N      094   24       00     W
    1942    K03Y    48    47       00      N      096   57       00     W
    1943    K04V    38    05       50      N      106   10       07     W
    1944    K05F    31    25       16      N      097   47       49     W
    1945    K06D    48    53       04      N      099   37       15     W

I can get matches to this regex:

data.format <- "\\d{1,3}-\\d{1,3}(?:-\\d{1,3})?[NSWE]{1}"
grep(data.format, wthrStat1$lat)

But am unsure how to get the individual parts into columns. I've tried a few things like:

wthrStat1$latHr <- ifelse(grepl(data.format, wthrStat1$lat), gsub(????), NA)

but with no luck.

Here's a dput():

> dput(wthrStat1[1:10,] )
structure(list(Station = c("K01R", "K01T", "K03Y", "K04V", "K05F", 
"K06D", "K07G", "K07S", "K08D", "K0B9"), lat = c("31-08N", "28-08N", 
"48-47N", "38-05-50N", "31-25-16N", "48-53-04N", "42-34-28N", 
"47-58-27N", "48-18-03N", "43-20N"), lon = c("092-34W", "094-24W", 
"096-57W", "106-10-07W", "097-47-49W", "099-37-15W", "084-48-41W", 
"117-25-42W", "102-24-23W", "070-24W")), .Names = c("Station", 
"lat", "lon"), row.names = 1940:1949, class = "data.frame")

Any suggestions?

like image 741
screechOwl Avatar asked Feb 05 '13 05:02

screechOwl


2 Answers

strapplyc in the gsubfn package will extract each group in the regular expression surrounded with parentheses:

library(gsubfn)
data.format <- "(\\d{1,3})-(\\d{1,3})-?(\\d{1,3})?([NSWE]{1})"
parts <- strapplyc(wthrStat1$lat, data.format, simplify = rbind)
parts[parts == ""] <- "00"

which gives:

> parts
      [,1] [,2] [,3] [,4]
 [1,] "31" "08" "00" "N" 
 [2,] "28" "08" "00" "N" 
 [3,] "48" "47" "00" "N" 
 [4,] "38" "05" "50" "N" 
 [5,] "31" "25" "16" "N" 
 [6,] "48" "53" "04" "N" 
 [7,] "42" "34" "28" "N" 
 [8,] "47" "58" "27" "N" 
 [9,] "48" "18" "03" "N" 
[10,] "43" "20" "00" "N" 
like image 181
G. Grothendieck Avatar answered Oct 16 '22 15:10

G. Grothendieck


it is extremely inefficient , I hope someone else had better solution:

dat <- read.table(text ='   Station       lat        lon
1940    K01R    31-08N    092-34W
1941    K01T    28-08N    094-24W
1942    K03Y    48-47N    096-57W
1943    K04V 38-05-50N 106-10-07W
1944    K05F 31-25-16N 097-47-49W
1945    K06D 48-53-04N 099-37-15W', head=T)


pattern <- '([0-9]+)[-]([0-9]+)([-|A-Z]+)([0-9]*)([A-Z]*)'

dat$latHr <- gsub(pattern,'\\1',dat$lat)
dat$latMin    <- gsub(pattern,'\\2',dat$lat)

latSec    <- gsub(pattern,'\\4',dat$lat)
latSec[nchar(latSec)==0] <- '00'
dat$latSec <- latSec

latDir <- gsub(pattern,'\\5',dat$lat)
latDir[nchar(latDir)==0] <- latDir[nchar(latDir)!=0][1]
dat$latDir <- latDir

dat
     Station       lat        lon latHr latMin latSec latDir
1940    K01R    31-08N    092-34W    31     08     00      N
1941    K01T    28-08N    094-24W    28     08     00      N
1942    K03Y    48-47N    096-57W    48     47     00      N
1943    K04V 38-05-50N 106-10-07W    38     05     50      N
1944    K05F 31-25-16N 097-47-49W    31     25     16      N
1945    K06D 48-53-04N 099-37-15W    48     53     04      N
like image 39
agstudy Avatar answered Oct 16 '22 14:10

agstudy