The command to convert Dataframe to list is pd. DataFrame. values. tolist().
We can convert a given data frame to a list by row we use the split() function. The split() function is used to split the data according to our requirement and the arguments given in the split() function are data itself and the other one is seq() function.
Pandas DataFrame can be converted into lists in multiple ways. Let's have a look at different ways of converting a DataFrame one by one. Method #1: Converting a DataFrame to List containing all the rows of a particular column: Python3.
Use the tolist() Method to Convert a Dataframe Column to a List. A column in the Pandas dataframe is a Pandas Series . So if we need to convert a column to a list, we can use the tolist() method in the Series . tolist() converts the Series of pandas data-frame to a list.
Like this:
xy.list <- split(xy.df, seq(nrow(xy.df)))
And if you want the rownames of xy.df
to be the names of the output list, you can do:
xy.list <- setNames(split(xy.df, seq(nrow(xy.df))), rownames(xy.df))
Eureka!
xy.list <- as.list(as.data.frame(t(xy.df)))
If you want to completely abuse the data.frame (as I do) and like to keep the $ functionality, one way is to split you data.frame into one-line data.frames gathered in a list :
> df = data.frame(x=c('a','b','c'), y=3:1)
> df
x y
1 a 3
2 b 2
3 c 1
# 'convert' into a list of data.frames
ldf = lapply(as.list(1:dim(df)[1]), function(x) df[x[1],])
> ldf
[[1]]
x y
1 a 3
[[2]]
x y
2 b 2
[[3]]
x y
3 c 1
# and the 'coolest'
> ldf[[2]]$y
[1] 2
It is not only intellectual masturbation, but allows to 'transform' the data.frame into a list of its lines, keeping the $ indexation which can be useful for further use with lapply (assuming the function you pass to lapply uses this $ indexation)
A more modern solution uses only purrr::transpose
:
library(purrr)
iris[1:2,] %>% purrr::transpose()
#> [[1]]
#> [[1]]$Sepal.Length
#> [1] 5.1
#>
#> [[1]]$Sepal.Width
#> [1] 3.5
#>
#> [[1]]$Petal.Length
#> [1] 1.4
#>
#> [[1]]$Petal.Width
#> [1] 0.2
#>
#> [[1]]$Species
#> [1] 1
#>
#>
#> [[2]]
#> [[2]]$Sepal.Length
#> [1] 4.9
#>
#> [[2]]$Sepal.Width
#> [1] 3
#>
#> [[2]]$Petal.Length
#> [1] 1.4
#>
#> [[2]]$Petal.Width
#> [1] 0.2
#>
#> [[2]]$Species
#> [1] 1
I was working on this today for a data.frame (really a data.table) with millions of observations and 35 columns. My goal was to return a list of data.frames (data.tables) each with a single row. That is, I wanted to split each row into a separate data.frame and store these in a list.
Here are two methods I came up with that were roughly 3 times faster than split(dat, seq_len(nrow(dat)))
for that data set. Below, I benchmark the three methods on a 7500 row, 5 column data set (iris repeated 50 times).
library(data.table)
library(microbenchmark)
microbenchmark(
split={dat1 <- split(dat, seq_len(nrow(dat)))},
setDF={dat2 <- lapply(seq_len(nrow(dat)),
function(i) setDF(lapply(dat, "[", i)))},
attrDT={dat3 <- lapply(seq_len(nrow(dat)),
function(i) {
tmp <- lapply(dat, "[", i)
attr(tmp, "class") <- c("data.table", "data.frame")
setDF(tmp)
})},
datList = {datL <- lapply(seq_len(nrow(dat)),
function(i) lapply(dat, "[", i))},
times=20
)
This returns
Unit: milliseconds
expr min lq mean median uq max neval
split 861.8126 889.1849 973.5294 943.2288 1041.7206 1250.6150 20
setDF 459.0577 466.3432 511.2656 482.1943 500.6958 750.6635 20
attrDT 399.1999 409.6316 461.6454 422.5436 490.5620 717.6355 20
datList 192.1175 201.9896 241.4726 208.4535 246.4299 411.2097 20
While the differences are not as large as in my previous test, the straight setDF
method is significantly faster at all levels of the distribution of runs with max(setDF) < min(split) and the attr
method is typically more than twice as fast.
A fourth method is the extreme champion, which is a simple nested lapply
, returning a nested list. This method exemplifies the cost of constructing a data.frame from a list. Moreover, all methods I tried with the data.frame
function were roughly an order of magnitude slower than the data.table
techniques.
data
dat <- vector("list", 50)
for(i in 1:50) dat[[i]] <- iris
dat <- setDF(rbindlist(dat))
A couple of more options :
With asplit
asplit(xy.df, 1)
#[[1]]
# x y
#0.1137 0.6936
#[[2]]
# x y
#0.6223 0.5450
#[[3]]
# x y
#0.6093 0.2827
#....
With split
and row
split(xy.df, row(xy.df)[, 1])
#$`1`
# x y
#1 0.1137 0.6936
#$`2`
# x y
#2 0.6223 0.545
#$`3`
# x y
#3 0.6093 0.2827
#....
data
set.seed(1234)
xy.df <- data.frame(x = runif(10), y = runif(10))
Seems a current version of the purrr
(0.2.2) package is the fastest solution:
by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
Let's compare the most interesting solutions:
data("Batting", package = "Lahman")
x <- Batting[1:10000, 1:10]
library(benchr)
library(purrr)
benchmark(
split = split(x, seq_len(.row_names_info(x, 2L))),
mapply = .mapply(function(...) structure(list(...), class = "data.frame", row.names = 1L), x, NULL),
purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
)
Rsults:
Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
split 100 983.0 1060.0 1130.0 1130.0 1180.0 1450 113000 34.3
mapply 100 826.0 894.0 963.0 972.0 1030.0 1320 97200 29.3
purrr 100 24.1 28.6 32.9 44.9 40.5 183 4490 1.0
Also we can get the same result with Rcpp
:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List df2list(const DataFrame& x) {
std::size_t nrows = x.rows();
std::size_t ncols = x.cols();
CharacterVector nms = x.names();
List res(no_init(nrows));
for (std::size_t i = 0; i < nrows; ++i) {
List tmp(no_init(ncols));
for (std::size_t j = 0; j < ncols; ++j) {
switch(TYPEOF(x[j])) {
case INTSXP: {
if (Rf_isFactor(x[j])) {
IntegerVector t = as<IntegerVector>(x[j]);
RObject t2 = wrap(t[i]);
t2.attr("class") = "factor";
t2.attr("levels") = t.attr("levels");
tmp[j] = t2;
} else {
tmp[j] = as<IntegerVector>(x[j])[i];
}
break;
}
case LGLSXP: {
tmp[j] = as<LogicalVector>(x[j])[i];
break;
}
case CPLXSXP: {
tmp[j] = as<ComplexVector>(x[j])[i];
break;
}
case REALSXP: {
tmp[j] = as<NumericVector>(x[j])[i];
break;
}
case STRSXP: {
tmp[j] = as<std::string>(as<CharacterVector>(x[j])[i]);
break;
}
default: stop("Unsupported type '%s'.", type2name(x));
}
}
tmp.attr("class") = "data.frame";
tmp.attr("row.names") = 1;
tmp.attr("names") = nms;
res[i] = tmp;
}
res.attr("names") = x.attr("row.names");
return res;
}
Now caompare with purrr
:
benchmark(
purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out,
rcpp = df2list(x)
)
Results:
Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
purrr 100 25.2 29.8 37.5 43.4 44.2 159.0 4340 1.1
rcpp 100 19.0 27.9 34.3 35.8 37.2 93.8 3580 1.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With