Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.frame rows to a list

Tags:

list

dataframe

r

People also ask

How do I turn a DataFrame row into a list?

The command to convert Dataframe to list is pd. DataFrame. values. tolist().

How do I convert a DataFrame row to a list in R?

We can convert a given data frame to a list by row we use the split() function. The split() function is used to split the data according to our requirement and the arguments given in the split() function are data itself and the other one is seq() function.

Can you put Dataframes in a list Python?

Pandas DataFrame can be converted into lists in multiple ways. Let's have a look at different ways of converting a DataFrame one by one. Method #1: Converting a DataFrame to List containing all the rows of a particular column: Python3.

How do you convert a column in a DataFrame to a list?

Use the tolist() Method to Convert a Dataframe Column to a List. A column in the Pandas dataframe is a Pandas Series . So if we need to convert a column to a list, we can use the tolist() method in the Series . tolist() converts the Series of pandas data-frame to a list.


Like this:

xy.list <- split(xy.df, seq(nrow(xy.df)))

And if you want the rownames of xy.df to be the names of the output list, you can do:

xy.list <- setNames(split(xy.df, seq(nrow(xy.df))), rownames(xy.df))

Eureka!

xy.list <- as.list(as.data.frame(t(xy.df)))

If you want to completely abuse the data.frame (as I do) and like to keep the $ functionality, one way is to split you data.frame into one-line data.frames gathered in a list :

> df = data.frame(x=c('a','b','c'), y=3:1)
> df
  x y
1 a 3
2 b 2
3 c 1

# 'convert' into a list of data.frames
ldf = lapply(as.list(1:dim(df)[1]), function(x) df[x[1],])

> ldf
[[1]]
x y
1 a 3    
[[2]]
x y
2 b 2
[[3]]
x y
3 c 1

# and the 'coolest'
> ldf[[2]]$y
[1] 2

It is not only intellectual masturbation, but allows to 'transform' the data.frame into a list of its lines, keeping the $ indexation which can be useful for further use with lapply (assuming the function you pass to lapply uses this $ indexation)


A more modern solution uses only purrr::transpose:

library(purrr)
iris[1:2,] %>% purrr::transpose()
#> [[1]]
#> [[1]]$Sepal.Length
#> [1] 5.1
#> 
#> [[1]]$Sepal.Width
#> [1] 3.5
#> 
#> [[1]]$Petal.Length
#> [1] 1.4
#> 
#> [[1]]$Petal.Width
#> [1] 0.2
#> 
#> [[1]]$Species
#> [1] 1
#> 
#> 
#> [[2]]
#> [[2]]$Sepal.Length
#> [1] 4.9
#> 
#> [[2]]$Sepal.Width
#> [1] 3
#> 
#> [[2]]$Petal.Length
#> [1] 1.4
#> 
#> [[2]]$Petal.Width
#> [1] 0.2
#> 
#> [[2]]$Species
#> [1] 1

I was working on this today for a data.frame (really a data.table) with millions of observations and 35 columns. My goal was to return a list of data.frames (data.tables) each with a single row. That is, I wanted to split each row into a separate data.frame and store these in a list.

Here are two methods I came up with that were roughly 3 times faster than split(dat, seq_len(nrow(dat))) for that data set. Below, I benchmark the three methods on a 7500 row, 5 column data set (iris repeated 50 times).

library(data.table)
library(microbenchmark)

microbenchmark(
split={dat1 <- split(dat, seq_len(nrow(dat)))},
setDF={dat2 <- lapply(seq_len(nrow(dat)),
                  function(i) setDF(lapply(dat, "[", i)))},
attrDT={dat3 <- lapply(seq_len(nrow(dat)),
           function(i) {
             tmp <- lapply(dat, "[", i)
             attr(tmp, "class") <- c("data.table", "data.frame")
             setDF(tmp)
           })},
datList = {datL <- lapply(seq_len(nrow(dat)),
                          function(i) lapply(dat, "[", i))},
times=20
) 

This returns

Unit: milliseconds
       expr      min       lq     mean   median        uq       max neval
      split 861.8126 889.1849 973.5294 943.2288 1041.7206 1250.6150    20
      setDF 459.0577 466.3432 511.2656 482.1943  500.6958  750.6635    20
     attrDT 399.1999 409.6316 461.6454 422.5436  490.5620  717.6355    20
    datList 192.1175 201.9896 241.4726 208.4535  246.4299  411.2097    20

While the differences are not as large as in my previous test, the straight setDF method is significantly faster at all levels of the distribution of runs with max(setDF) < min(split) and the attr method is typically more than twice as fast.

A fourth method is the extreme champion, which is a simple nested lapply, returning a nested list. This method exemplifies the cost of constructing a data.frame from a list. Moreover, all methods I tried with the data.frame function were roughly an order of magnitude slower than the data.table techniques.

data

dat <- vector("list", 50)
for(i in 1:50) dat[[i]] <- iris
dat <- setDF(rbindlist(dat))

A couple of more options :

With asplit

asplit(xy.df, 1)
#[[1]]
#     x      y 
#0.1137 0.6936 

#[[2]]
#     x      y 
#0.6223 0.5450 

#[[3]]
#     x      y 
#0.6093 0.2827 
#....

With split and row

split(xy.df, row(xy.df)[, 1])

#$`1`
#       x      y
#1 0.1137 0.6936

#$`2`
#       x     y
#2 0.6223 0.545

#$`3`
#       x      y
#3 0.6093 0.2827
#....

data

set.seed(1234)
xy.df <- data.frame(x = runif(10),  y = runif(10))

Seems a current version of the purrr (0.2.2) package is the fastest solution:

by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out

Let's compare the most interesting solutions:

data("Batting", package = "Lahman")
x <- Batting[1:10000, 1:10]
library(benchr)
library(purrr)
benchmark(
    split = split(x, seq_len(.row_names_info(x, 2L))),
    mapply = .mapply(function(...) structure(list(...), class = "data.frame", row.names = 1L), x, NULL),
    purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
)

Rsults:

Benchmark summary:
Time units : milliseconds 
  expr n.eval   min  lw.qu median   mean  up.qu  max  total relative
 split    100 983.0 1060.0 1130.0 1130.0 1180.0 1450 113000     34.3
mapply    100 826.0  894.0  963.0  972.0 1030.0 1320  97200     29.3
 purrr    100  24.1   28.6   32.9   44.9   40.5  183   4490      1.0

Also we can get the same result with Rcpp:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List df2list(const DataFrame& x) {
    std::size_t nrows = x.rows();
    std::size_t ncols = x.cols();
    CharacterVector nms = x.names();
    List res(no_init(nrows));
    for (std::size_t i = 0; i < nrows; ++i) {
        List tmp(no_init(ncols));
        for (std::size_t j = 0; j < ncols; ++j) {
            switch(TYPEOF(x[j])) {
                case INTSXP: {
                    if (Rf_isFactor(x[j])) {
                        IntegerVector t = as<IntegerVector>(x[j]);
                        RObject t2 = wrap(t[i]);
                        t2.attr("class") = "factor";
                        t2.attr("levels") = t.attr("levels");
                        tmp[j] = t2;
                    } else {
                        tmp[j] = as<IntegerVector>(x[j])[i];
                    }
                    break;
                }
                case LGLSXP: {
                    tmp[j] = as<LogicalVector>(x[j])[i];
                    break;
                }
                case CPLXSXP: {
                    tmp[j] = as<ComplexVector>(x[j])[i];
                    break;
                }
                case REALSXP: {
                    tmp[j] = as<NumericVector>(x[j])[i];
                    break;
                }
                case STRSXP: {
                    tmp[j] = as<std::string>(as<CharacterVector>(x[j])[i]);
                    break;
                }
                default: stop("Unsupported type '%s'.", type2name(x));
            }
        }
        tmp.attr("class") = "data.frame";
        tmp.attr("row.names") = 1;
        tmp.attr("names") = nms;
        res[i] = tmp;
    }
    res.attr("names") = x.attr("row.names");
    return res;
}

Now caompare with purrr:

benchmark(
    purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out,
    rcpp = df2list(x)
)

Results:

Benchmark summary:
Time units : milliseconds 
 expr n.eval  min lw.qu median mean up.qu   max total relative
purrr    100 25.2  29.8   37.5 43.4  44.2 159.0  4340      1.1
 rcpp    100 19.0  27.9   34.3 35.8  37.2  93.8  3580      1.0