Data table has vector as an entry - how to find out in which column and then only take the second entry of vector as a single integer

Tags:

I have a data table tmp, which can look like this (just a short example):

dput(tmp)
structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32), 
    `2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
        c(46.8, 41.87), c(0, 0), c(0, 0), c(0, 0), c(10.03, 10.04
        )), `2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
        40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85, 
        0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0, 
        10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99), 
    `2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA, 
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)

Here we can see, that the third column ("2020-03-29-03") has vector entries. What I want is to take the second entry of this vector as a single integer entry. The vector-column (here: third column) isn't always on the same column-index. So, first we need to find out the place where the entry is a vector and then only taking the second entry of this vector.

In the end my data table should look like this:

structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32), 
    `2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
        c(41.87), 0, 0, 0, c(10.04)), 
    `2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
        40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85, 
        0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0, 
        10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99), 
    `2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA, 
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)

401

asked Aug 11 '21 08:08

MikiK

Video Answer

4 Answers

If you inspect tmp using str(tmp) or lapply(tmp, class) you will notice that all columns are list columns, even those where the vectors contain only one element.

Also, this can be disclosed by setting the appropriate print option

library(data.table)
options(datatable.print.class = TRUE)
tmp

   2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09
          <list>        <list>        <list>        <list>        <list>        <list>        <list>        <list>        <list>
1:         42.51          46.8   46.80,41.87         45.63         40.86         45.85         43.68         47.14         49.06
2:             0             0           0,0             0             0             0             0             0             0
3:             0             0           0,0             0             0             0             0             0             0
4:             0             0           0,0             0             0             0             0             0             0
5:         12.32         10.03   10.03,10.04          9.24          9.06          9.19         10.39          9.99         11.24

So, in case that all list columns need to be coerced to numeric we can pick the last value in each vector (which happens to be the second vector entry in column 3) by using the last() function:

tmp[, lapply(.SD, \(x) sapply(x, last)), .SDcols = is.list]

   2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09
           <num>         <num>         <num>         <num>         <num>         <num>         <num>         <num>         <num>
1:         42.51         46.80         41.87         45.63         40.86         45.85         43.68         47.14         49.06
2:          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00
3:          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00
4:          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00
5:         12.32         10.03         10.04          9.24          9.06          9.19         10.39          9.99         11.24

Now, all columns are numeric.

171

answered Oct 25 '22 20:10

Uwe

a quick and dirty method:

as.data.table(lapply(dt, \(x){
  if(length(x) == sum(lengths(x)))
    x
  else
    sapply(x, \(y)y[[2]])
}))

Alternative, but using the in-place aspect of data.tables

for(i in names(dt)[sapply(dt, \(x)sum(lengths(x)) != length(x))]){
  set(dt, j = i, value = sapply(dt[[i]], \(y)y[[2]]))
}

Note that I use the new lambda function asepcts in R 4.1.0. Before you would have to use function(x) and function(y) in place of \(x) and \(y).

answered Oct 25 '22 20:10

Oliver

try apply in loop by cols

for (col in colnames(tmp)) {
  tmp[,col] <- apply(tmp[,..col], 1, function(x) {
    # mean(unlist(x), na.rm = TRUE) ## if you want mean insted of second entry of this vector
    ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
    }  
  )  
}

or just apply

tmp <- apply(tmp, c(1,2), function(x) {
  # mean(unlist(x), na.rm = TRUE)
  ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
  } 
) %>% as.data.table() ## convert to data.table from matrix

answered Oct 25 '22 20:10

dy_by

This solution should be robust enough to handle your problem. It does automatically check, which cols need to be cleaned. If you want to specify certain rows, just change up cols_contain_vec to an vector of column indices.

# Find the relevant cols which contain vectors
# which cols contain max lengths over 1?
cols_contain_vec <- which(apply(tmp, MARGIN = 2,function(x) max(lengths(x))) > 1)



tmp[,cols_contain_vec] <- apply(
  tmp[,cols_contain_vec, with = FALSE],
  # separate function call for every row (1) and column(2)
  MARGIN = c(1,2),
  function(x) { # Return second entry if possible, for some reason the vectors are saved
                # as lists, so we have to unlist them
    relevant_vec <- unlist(x)
    if(length(relevant_vec)>1){
      # if vector length over 1, return second element
      return(relevant_vec[[2]])
    } else {
      # if vector length is below 2 then return the first value
      return(relevant_vec[[1]])
    }
  })
)

This results in the following:

> tmp
   2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06
1:         42.51          46.8         41.87         45.63         40.86         45.85
2:             0             0          0.00             0             0             0
3:             0             0          0.00             0             0             0
4:             0             0          0.00             0             0             0
5:         12.32         10.03         10.04          9.24          9.06          9.19
   2020-03-29-07 2020-03-29-08 2020-03-29-09
1:         43.68         47.14         49.06
2:             0             0             0
3:             0             0             0
4:             0             0             0
5:         10.39          9.99         11.24

I hope this helps.

answered Oct 25 '22 22:10

Sandwichnick

Related questions
                            
                                R remove duplicate rows keeping those with values
                            
                                How do I use facetting correctly in ggplot geom_tile, while keeping the aspect ratio intact?
                            
                                Chop off the first letter of every variable name [duplicate]
                            
                                Dplyr filter top and bottom rows by value simultaneously on grouped data
                            
                                Compare the words from a data frame and calculate a matrix with the length of the biggest word for each pair
                            
                                How to make a frequency table by class [duplicate]
                            
                                One hot encode list of vectors
                            
                                Cumulative sum in R by group and start over when sum of values in group larger than maximum value
                            
                                Fail to render an animation
                            
                                Apply a summarise condition to a range of columns when using dplyr group_by?
                            
                                Compilation failed when installing Rcpp
                            
                                Slow dplyr query in R
                            
                                Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘select’ for signature ‘"spec_tbl_df"’
                            
                                How can I put a scalebar and a north arrow on the map (ggplot)?
                            
                                Create two column with multiple separators
                            
                                Get the most frequent value per row and account for ties [duplicate]
                            
                                Create list of lists from tibble in R (tidyverse)
                            
                                R lapply for list of lists to apply the same function to pre-defined columns
                            
                                Conditional term after the `~` in a `case_when` function
                            
                                R - Delete string in character vector that begins with capital letter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data table has vector as an entry - how to find out in which column and then only take the second entry of vector as a single integer

Tags:

r

data.table

vector

MikiK

People also ask

Video Answer

4 Answers

Uwe

Oliver

dy_by

Sandwichnick

Recent Activity

Donate For Us