I have a character vector in R with each string composed of "continent / country / city", e.g.
x=rep("Africa / Kenya / Nairobi", 1000000)
but the " / " is occasionally mistyped without the bracketing spaces as "/" and in some cases the city is also missing, so that it would e.g. be "Africa / Kenya", without the city.
I would like to parse this into three vectors continent, country & city, using NA if city is missing.
For country I now did something like
country = sapply(x, function(loc) trimws(strsplit(loc,"/", fixed = TRUE)[[1]][2]))
but that's very slow if the vector x is long. What would be an efficient way to parse this in R?
Consider using read.table
from base R
read.table(text = x, sep = "/", header = FALSE,
fill = TRUE, strip.white = TRUE, na.strings = "")
V1 V2 V3
1 Africa Kenya Nairobi
2 Africa Kenya Nairobi
3 Africa Kenya <NA>
Or using fread
from data.table
library(data.table)
fread(text = paste(x, collapse="\n"), sep="/", fill = TRUE, na.strings = "")
Africa Kenya Nairobi
1: Africa Kenya Nairobi
2: Africa Kenya <NA>
x <- rep("Africa / Kenya / Nairobi", 1000000)
>
> system.time(fread(text = paste(x, collapse="\n"), sep="/", fill = TRUE, na.strings = ""))
user system elapsed
0.473 0.024 0.496
> system.time(read.table(text = x, sep = "/", header = FALSE,
+ fill = TRUE, strip.white = TRUE, na.strings = ""))
user system elapsed
0.519 0.026 0.543
> system.time({ #Using data.table
+ y <- do.call(cbind, data.table::tstrsplit(x, "/", TRUE))
+ y <- trimws(y, whitespace = " ")
+ })
user system elapsed
2.035 0.051 2.067
x <- c("Africa / Kenya / Nairobi", "Africa/Kenya/Nairobi", "Africa / Kenya")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With