I have a data table tmp
, which can look like this (just a short example):
dput(tmp)
structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32),
`2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
c(46.8, 41.87), c(0, 0), c(0, 0), c(0, 0), c(10.03, 10.04
)), `2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85,
0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0,
10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99),
`2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA,
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)
Here we can see, that the third column ("2020-03-29-03
") has vector entries. What I want is to take the second entry of this vector as a single integer entry. The vector-column (here: third column) isn't always on the same column-index. So, first we need to find out the place where the entry is a vector and then only taking the second entry of this vector.
In the end my data table should look like this:
structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32),
`2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
c(41.87), 0, 0, 0, c(10.04)),
`2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85,
0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0,
10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99),
`2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA,
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)
By using the Column name or Column index we can identify a column in a data table.
The way you tell R that you want to select some particular elements (i.e., a 'subset') from a vector is by placing an 'index vector' in square brackets immediately following the name of the vector. For a simple example, try x[1:10] to view the first ten elements of x.
The setDT() method can be used to coerce the dataframe or the lists into data. table, where the conversion is made to the original dataframe.
SD stands for "Subset of Data. table". The dot before SD has no significance but doesn't let it clash with a user-defined column name.
This tutorial illustrates how to convert a data.table variable to a vector in the R programming language. Let’s jump right to the example… First, we have to install and load the data.table package, if we want to use the functions that are contained in the package: Now, we can use the data.table function to create an exemplifying table in R:
This is done by the find () function which basically returns an iterator to the first element in the range of vector elements [first, last) on comparing the elements equals to the val (value to be searched). If the val to be searched is not found in the range, the function returns last.
As you can see based on the previous output of the RStudio console, our example data contains of five rows and three variables. In this Section, I’ll illustrate how to use a column of our data.table as vector (or array).
The function searchResult () returns index of element in the vector or -1 denoting the absence of the element in the vector. Line 2 - 8: We have declared a structure compare that compares the int k to values passed in its arguments. You can read up on C++ structures at the external link provided at the end of this post.
If you inspect tmp
using str(tmp)
or lapply(tmp, class)
you will notice that all columns are list columns, even those where the vectors contain only one element.
Also, this can be disclosed by setting the appropriate print option
library(data.table)
options(datatable.print.class = TRUE)
tmp
2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09 <list> <list> <list> <list> <list> <list> <list> <list> <list> 1: 42.51 46.8 46.80,41.87 45.63 40.86 45.85 43.68 47.14 49.06 2: 0 0 0,0 0 0 0 0 0 0 3: 0 0 0,0 0 0 0 0 0 0 4: 0 0 0,0 0 0 0 0 0 0 5: 12.32 10.03 10.03,10.04 9.24 9.06 9.19 10.39 9.99 11.24
So, in case that all list columns need to be coerced to numeric we can pick the last value in each vector (which happens to be the second vector entry in column 3) by using the last()
function:
tmp[, lapply(.SD, \(x) sapply(x, last)), .SDcols = is.list]
2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09 <num> <num> <num> <num> <num> <num> <num> <num> <num> 1: 42.51 46.80 41.87 45.63 40.86 45.85 43.68 47.14 49.06 2: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5: 12.32 10.03 10.04 9.24 9.06 9.19 10.39 9.99 11.24
Now, all columns are numeric.
a quick and dirty method:
as.data.table(lapply(dt, \(x){
if(length(x) == sum(lengths(x)))
x
else
sapply(x, \(y)y[[2]])
}))
Alternative, but using the in-place aspect of data.tables
for(i in names(dt)[sapply(dt, \(x)sum(lengths(x)) != length(x))]){
set(dt, j = i, value = sapply(dt[[i]], \(y)y[[2]]))
}
Note that I use the new lambda function asepcts in R 4.1.0. Before you would have to use function(x)
and function(y)
in place of \(x)
and \(y)
.
try apply
in loop by cols
for (col in colnames(tmp)) {
tmp[,col] <- apply(tmp[,..col], 1, function(x) {
# mean(unlist(x), na.rm = TRUE) ## if you want mean insted of second entry of this vector
ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
}
)
}
or just apply
tmp <- apply(tmp, c(1,2), function(x) {
# mean(unlist(x), na.rm = TRUE)
ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
}
) %>% as.data.table() ## convert to data.table from matrix
This solution should be robust enough to handle your problem.
It does automatically check, which cols need to be cleaned. If you want to specify certain rows, just change up cols_contain_vec
to an vector of column indices.
# Find the relevant cols which contain vectors
# which cols contain max lengths over 1?
cols_contain_vec <- which(apply(tmp, MARGIN = 2,function(x) max(lengths(x))) > 1)
tmp[,cols_contain_vec] <- apply(
tmp[,cols_contain_vec, with = FALSE],
# separate function call for every row (1) and column(2)
MARGIN = c(1,2),
function(x) { # Return second entry if possible, for some reason the vectors are saved
# as lists, so we have to unlist them
relevant_vec <- unlist(x)
if(length(relevant_vec)>1){
# if vector length over 1, return second element
return(relevant_vec[[2]])
} else {
# if vector length is below 2 then return the first value
return(relevant_vec[[1]])
}
})
)
This results in the following:
> tmp
2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06
1: 42.51 46.8 41.87 45.63 40.86 45.85
2: 0 0 0.00 0 0 0
3: 0 0 0.00 0 0 0
4: 0 0 0.00 0 0 0
5: 12.32 10.03 10.04 9.24 9.06 9.19
2020-03-29-07 2020-03-29-08 2020-03-29-09
1: 43.68 47.14 49.06
2: 0 0 0
3: 0 0 0
4: 0 0 0
5: 10.39 9.99 11.24
I hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With