Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a large dataframe into a list of data frames based on common value in column

I have a data frame with 10 columns, collecting actions of "users", where one of the columns contains an ID (not unique, identifying user)(column 10). the length of the data frame is about 750000 rows. I am trying to extract individual data frames (so getting a list or vector of data frames) split by the column containing the "user" identifier, to isolate the actions of a single actor.

ID | Data1 | Data2 | ... | UserID 1  | aaa   | bbb   | ... | u_001 2  | aab   | bb2   | ... | u_001 3  | aac   | bb3   | ... | u_001 4  | aad   | bb4   | ... | u_002 

resulting into

list( ID | Data1 | Data2 | ... | UserID 1  | aaa   | bbb   | ... | u_001 2  | aab   | bb2   | ... | u_001 3  | aac   | bb3   | ... | u_001 , 4  | aad   | bb4   | ... | u_002 ...) 

The following works very well for me on a small sample (1000 rows):

paths = by(smallsampleMat, smallsampleMat[,"userID"], function(x) x) 

and then accessing the element I want by paths[1] for instance.

When applying on the original large data frame or even a matrix representation, this chokes my machine ( 4GB RAM, MacOSX 10.6, R 2.15) and never completes (I know that a newer R version exists, but I believe this is not the main problem).

It seems that split is more performant and after a long time completes, but I do not know ( inferior R knowledge) how to piece the resulting list of vectors into a vector of matrices.

path = split(smallsampleMat, smallsampleMat[,10])  

I have considered also using big.matrix etc, but without much success that would speed up the process.

like image 713
MartinT Avatar asked Aug 30 '13 07:08

MartinT


People also ask

How do you split a DataFrame list in Python?

If you wanted to split a column of delimited strings rather than lists, you could similarly do: df["teams"]. str. split('<delim>', expand=True) already returns a DataFrame, so it would probably be simpler to just rename the columns.


2 Answers

You can just as easily access each element in the list using e.g. path[[1]]. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply functions to further operate on each element in the list. Example below.

#  For reproducibile data set.seed(1)  #  Make some data userid <- rep(1:2,times=4) data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) ) data2 <- sample(10,8) df <- data.frame( userid , data1 , data2 )  #  Split on userid out <- split( df , f = df$userid ) #$`1` #  userid data1 data2 #1      1   gjn     3 #3      1   yqp     1 #5      1   rjs     6 #7      1   jtw     5  #$`2` #  userid data1 data2 #2      2   xfv     4 #4      2   bfe    10 #6      2   mrx     2 #8      2   fqd     9 

Access each element using the [[ operator like this:

out[[1]] #  userid data1 data2 #1      1   gjn     3 #3      1   yqp     1 #5      1   rjs     6 #7      1   jtw     5 

Or use an *apply function to do further operations on each list element. For instance, to take the mean of the data2 column you could use sapply like this:

sapply( out , function(x) mean( x$data2 ) ) #   1    2  #3.75 6.25  
like image 189
Simon O'Hanlon Avatar answered Oct 12 '22 22:10

Simon O'Hanlon


From version 0.8.0, dplyr offers a handy function called group_split():

# On sample data from @Aus_10  df %>%   group_split(g)  [[1]] # A tibble: 25 x 3    ran_data1 ran_data2 g            <dbl>     <dbl> <fct>  1     2.04      0.627 A      2     0.530    -0.703 A      3    -0.475     0.541 A      4     1.20     -0.565 A      5    -0.380    -0.126 A      6     1.25     -1.69  A      7    -0.153    -1.02  A      8     1.52     -0.520 A      9     0.905    -0.976 A     10     0.517    -0.535 A     # … with 15 more rows  [[2]] # A tibble: 25 x 3    ran_data1 ran_data2 g            <dbl>     <dbl> <fct>  1     1.61      0.858 B      2     1.05     -1.25  B      3    -0.440    -0.506 B      4    -1.17      1.81  B      5     1.47     -1.60  B      6    -0.682    -0.726 B      7    -2.21      0.282 B      8    -0.499     0.591 B      9     0.711    -1.21  B     10     0.705     0.960 B     # … with 15 more rows 

To not include the grouping column:

df %>%  group_split(g, keep = FALSE) 
like image 25
tmfmnk Avatar answered Oct 13 '22 00:10

tmfmnk