Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data table has vector as an entry - how to find out in which column and then only take the second entry of vector as a single integer

I have a data table tmp, which can look like this (just a short example):

dput(tmp)
structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32), 
    `2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
        c(46.8, 41.87), c(0, 0), c(0, 0), c(0, 0), c(10.03, 10.04
        )), `2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
        40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85, 
        0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0, 
        10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99), 
    `2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA, 
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)

Here we can see, that the third column ("2020-03-29-03") has vector entries. What I want is to take the second entry of this vector as a single integer entry. The vector-column (here: third column) isn't always on the same column-index. So, first we need to find out the place where the entry is a vector and then only taking the second entry of this vector.

In the end my data table should look like this:

structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32), 
    `2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
        c(41.87), 0, 0, 0, c(10.04)), 
    `2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
        40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85, 
        0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0, 
        10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99), 
    `2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA, 
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)
like image 401
MikiK Avatar asked Aug 11 '21 08:08

MikiK


People also ask

How can you identify a column in a DataTable?

By using the Column name or Column index we can identify a column in a data table.

How do you subset a vector in R?

The way you tell R that you want to select some particular elements (i.e., a 'subset') from a vector is by placing an 'index vector' in square brackets immediately following the name of the vector. For a simple example, try x[1:10] to view the first ten elements of x.

What does setDT do in R?

The setDT() method can be used to coerce the dataframe or the lists into data. table, where the conversion is made to the original dataframe.

What is .SD in data table?

SD stands for "Subset of Data. table". The dot before SD has no significance but doesn't let it clash with a user-defined column name.

How to convert a table to a vector in R?

This tutorial illustrates how to convert a data.table variable to a vector in the R programming language. Let’s jump right to the example… First, we have to install and load the data.table package, if we want to use the functions that are contained in the package: Now, we can use the data.table function to create an exemplifying table in R:

How do you find the first element in a vector?

This is done by the find () function which basically returns an iterator to the first element in the range of vector elements [first, last) on comparing the elements equals to the val (value to be searched). If the val to be searched is not found in the range, the function returns last.

How many rows and variables are there in the example data?

As you can see based on the previous output of the RStudio console, our example data contains of five rows and three variables. In this Section, I’ll illustrate how to use a column of our data.table as vector (or array).

How to find the index of an element in a vector?

The function searchResult () returns index of element in the vector or -1 denoting the absence of the element in the vector. Line 2 - 8: We have declared a structure compare that compares the int k to values passed in its arguments. You can read up on C++ structures at the external link provided at the end of this post.


Video Answer


4 Answers

If you inspect tmp using str(tmp) or lapply(tmp, class) you will notice that all columns are list columns, even those where the vectors contain only one element.

Also, this can be disclosed by setting the appropriate print option

library(data.table)
options(datatable.print.class = TRUE)
tmp
   2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09
          <list>        <list>        <list>        <list>        <list>        <list>        <list>        <list>        <list>
1:         42.51          46.8   46.80,41.87         45.63         40.86         45.85         43.68         47.14         49.06
2:             0             0           0,0             0             0             0             0             0             0
3:             0             0           0,0             0             0             0             0             0             0
4:             0             0           0,0             0             0             0             0             0             0
5:         12.32         10.03   10.03,10.04          9.24          9.06          9.19         10.39          9.99         11.24

So, in case that all list columns need to be coerced to numeric we can pick the last value in each vector (which happens to be the second vector entry in column 3) by using the last() function:

tmp[, lapply(.SD, \(x) sapply(x, last)), .SDcols = is.list]
   2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09
           <num>         <num>         <num>         <num>         <num>         <num>         <num>         <num>         <num>
1:         42.51         46.80         41.87         45.63         40.86         45.85         43.68         47.14         49.06
2:          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00
3:          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00
4:          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00
5:         12.32         10.03         10.04          9.24          9.06          9.19         10.39          9.99         11.24

Now, all columns are numeric.

like image 171
Uwe Avatar answered Oct 25 '22 20:10

Uwe


a quick and dirty method:

as.data.table(lapply(dt, \(x){
  if(length(x) == sum(lengths(x)))
    x
  else
    sapply(x, \(y)y[[2]])
}))

Alternative, but using the in-place aspect of data.tables

for(i in names(dt)[sapply(dt, \(x)sum(lengths(x)) != length(x))]){
  set(dt, j = i, value = sapply(dt[[i]], \(y)y[[2]]))
}

Note that I use the new lambda function asepcts in R 4.1.0. Before you would have to use function(x) and function(y) in place of \(x) and \(y).

like image 40
Oliver Avatar answered Oct 25 '22 20:10

Oliver


try apply in loop by cols

for (col in colnames(tmp)) {
  tmp[,col] <- apply(tmp[,..col], 1, function(x) {
    # mean(unlist(x), na.rm = TRUE) ## if you want mean insted of second entry of this vector
    ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
    }  
  )  
}

or just apply

tmp <- apply(tmp, c(1,2), function(x) {
  # mean(unlist(x), na.rm = TRUE)
  ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
  } 
) %>% as.data.table() ## convert to data.table from matrix
like image 3
dy_by Avatar answered Oct 25 '22 20:10

dy_by


This solution should be robust enough to handle your problem. It does automatically check, which cols need to be cleaned. If you want to specify certain rows, just change up cols_contain_vec to an vector of column indices.

# Find the relevant cols which contain vectors
# which cols contain max lengths over 1?
cols_contain_vec <- which(apply(tmp, MARGIN = 2,function(x) max(lengths(x))) > 1)



tmp[,cols_contain_vec] <- apply(
  tmp[,cols_contain_vec, with = FALSE],
  # separate function call for every row (1) and column(2)
  MARGIN = c(1,2),
  function(x) { # Return second entry if possible, for some reason the vectors are saved
                # as lists, so we have to unlist them
    relevant_vec <- unlist(x)
    if(length(relevant_vec)>1){
      # if vector length over 1, return second element
      return(relevant_vec[[2]])
    } else {
      # if vector length is below 2 then return the first value
      return(relevant_vec[[1]])
    }
  })
)

This results in the following:

> tmp
   2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06
1:         42.51          46.8         41.87         45.63         40.86         45.85
2:             0             0          0.00             0             0             0
3:             0             0          0.00             0             0             0
4:             0             0          0.00             0             0             0
5:         12.32         10.03         10.04          9.24          9.06          9.19
   2020-03-29-07 2020-03-29-08 2020-03-29-09
1:         43.68         47.14         49.06
2:             0             0             0
3:             0             0             0
4:             0             0             0
5:         10.39          9.99         11.24

I hope this helps.

like image 2
Sandwichnick Avatar answered Oct 25 '22 22:10

Sandwichnick