I would like to separate a column of strings such as [1, 58, 10] into columns using separate from tidyr. My problem is that sometimes the columns are shorter (never longer). I have many columns with this issue in the same data frame.
Loading packages
require(tidyr)
require(dplyr)
require(stringr)
The data
Here I make a data frame with samples from the real data. The "vectors" are of length 10 in col1 and 9 or 10 in col2. There is a time column just to show that there are other columns as well.
df <- data.frame(
time = as.POSIXct(1:5, origin=Sys.time()),
col1 = c("[0,355,0,0,0,1227,0,0,382059,116]", "[0,31,0,0,0,5,0,0,925,1]", "[0,1,0,0,0,471,0,0,130339,3946]", "[0,0,0,0,0,223,0,0,37666,12]", "[0,19,0,0,0,667,0,0,336956,53]"),
col2 = c("[0,355,0,0,0,1227,0,0,382059,116]", "[0,355,0,0,0,1227,0,0,382059,116]", "[0,0,0,0,0,223,0,0,37666,12]", "[0,19,0,0,0,667,0,0,336956]","[0,355,0,0,0,1227,0,0,382059,116]")
)
How I want it to be
For the first column where all "vectors" are of equal length I can use separate() to get what I want.
a1 <- df %>%
mutate(col1 = str_sub(col1,2,-2)) %>%
separate(col1, paste("col1",1:10,sep="."),",")
# Making sure the numbers are numeric
a1 <- as.data.frame(sapply(a1, as.numeric)) %>%
mutate(time = as.POSIXct(time, origin="1970-01-01")) %>% select(-col2)
This results in
> a1
time col1.1 col1.2 col1.3 col1.4 col1.5 col1.6 col1.7 col1.8
1 2014-11-07 12:21:45 0 355 0 0 0 1227 0 0
2 2014-11-07 12:21:46 0 31 0 0 0 5 0 0
3 2014-11-07 12:21:47 0 1 0 0 0 471 0 0
4 2014-11-07 12:21:48 0 0 0 0 0 223 0 0
5 2014-11-07 12:21:49 0 19 0 0 0 667 0 0
col1.9 col1.10
1 382059 116
2 925 1
3 130339 3946
4 37666 12
5 336956 53
This does not work for col2 where the elements can't be split into several columns
Workaround
# Does not work
#b1 <- df %>%
# mutate(col2 = str_sub(col1,2,-2)) %>%
# separate(col2, paste("col2",1:10,sep="."),",")
b2 <- sapply(as.data.frame(str_split_fixed(str_sub(df$col2,2,-2),',',n=10), stringsAsFactors=F), as.numeric)
colnames(b2) <- paste("col2",1:10,sep=".")
b2 <- as.data.frame(cbind(time=df$time, b2)) %>%
mutate(time = as.POSIXct(time, origin="1970-01-01"))
Which results in
> b2
time col2.1 col2.2 col2.3 col2.4 col2.5 col2.6 col2.7 col2.8
1 2014-11-07 12:21:45 0 355 0 0 0 1227 0 0
2 2014-11-07 12:21:46 0 355 0 0 0 1227 0 0
3 2014-11-07 12:21:47 0 0 0 0 0 223 0 0
4 2014-11-07 12:21:48 0 19 0 0 0 667 0 0
5 2014-11-07 12:21:49 0 355 0 0 0 1227 0 0
col2.9 col2.10
1 382059 116
2 382059 116
3 37666 12
4 336956 NA
5 382059 116
If the vector is shorter, the last elements shall be NA, so this is correct.
The questions
Is there a way to use separate (or some other simpler function) instead of the workaround? Is there a way to apply this to col1 and col2 at the same time (by selecting columns that starts with col for example)?
Thanks!
This only answers the first part of your question about separate
. There is an extra
argument in separate
(at least in the development version of tidyr) that will allow you to do what you want if you set extra
to "merge"
.
df %>%
mutate(col2 = str_sub(col2,2,-2)) %>%
separate(col2, paste("col2",1:10,sep="."), ",", extra = "merge")
time col1
1 2014-11-07 08:00:59 [0,355,0,0,0,1227,0,0,382059,116]
2 2014-11-07 08:01:00 [0,31,0,0,0,5,0,0,925,1]
3 2014-11-07 08:01:01 [0,1,0,0,0,471,0,0,130339,3946]
4 2014-11-07 08:01:02 [0,0,0,0,0,223,0,0,37666,12]
5 2014-11-07 08:01:03 [0,19,0,0,0,667,0,0,336956,53]
col2.1 col2.2 col2.3 col2.4 col2.5 col2.6 col2.7 col2.8
1 0 355 0 0 0 1227 0 0
2 0 355 0 0 0 1227 0 0
3 0 0 0 0 0 223 0 0
4 0 19 0 0 0 667 0 0
5 0 355 0 0 0 1227 0 0
col2.9 col2.10
1 382059 116
2 382059 116
3 37666 12
4 336956 <NA>
5 382059 116
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With