Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to parse vector of "continent / country / city" in R

I have a character vector in R with each string composed of "continent / country / city", e.g.

x=rep("Africa / Kenya / Nairobi", 1000000)

but the " / " is occasionally mistyped without the bracketing spaces as "/" and in some cases the city is also missing, so that it would e.g. be "Africa / Kenya", without the city.

I would like to parse this into three vectors continent, country & city, using NA if city is missing.

For country I now did something like

country = sapply(x, function(loc) trimws(strsplit(loc,"/", fixed = TRUE)[[1]][2]))

but that's very slow if the vector x is long. What would be an efficient way to parse this in R?

like image 826
Tom Wenseleers Avatar asked Dec 22 '22 15:12

Tom Wenseleers


1 Answers

Consider using read.table from base R

read.table(text = x, sep = "/", header = FALSE,
      fill = TRUE, strip.white = TRUE, na.strings = "")
      V1    V2      V3
1 Africa Kenya Nairobi
2 Africa Kenya Nairobi
3 Africa Kenya    <NA>

Or using fread from data.table

library(data.table)
fread(text = paste(x, collapse="\n"), sep="/", fill = TRUE, na.strings = "")
   Africa Kenya Nairobi
1: Africa Kenya Nairobi
2: Africa Kenya    <NA>

Benchmarks

x <- rep("Africa / Kenya / Nairobi", 1000000)
> 
> system.time(fread(text = paste(x, collapse="\n"), sep="/", fill = TRUE, na.strings = ""))
   user  system elapsed 
  0.473   0.024   0.496 

> system.time(read.table(text = x, sep = "/", header = FALSE,
+       fill = TRUE, strip.white = TRUE, na.strings = ""))
   user  system elapsed 
  0.519   0.026   0.543 

> system.time({  #Using data.table
+   y <- do.call(cbind, data.table::tstrsplit(x, "/", TRUE))
+   y <- trimws(y, whitespace = " ")
+ })
   user  system elapsed 
  2.035   0.051   2.067 

data

x <- c("Africa / Kenya / Nairobi", "Africa/Kenya/Nairobi", "Africa / Kenya")
like image 172
akrun Avatar answered Jan 18 '23 16:01

akrun