Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What do compute, collect, and collapse do in dplyr?

Tags:

dataframe

r

dplyr

I'm now learning dplyr package in R, but hit the wall in understanding what the three functions - compute, collect, and collapse - do.

I understand that the dplyr doesn't use the type data.frame internally; it instead stores its data into its own type tbl or tbl_df.

Then, in order to convert the custom type back to R's default data.frame to utilize a set of default functions on data.frame, you must use collect, such as:

batting <- tbl(lahman_sqlite(), "Batting")
dim(collect(batting))

This returns [1] 99846 22 as of 2016, while dim(batting) returns [1] NA 22.

However, I'm not sure what the other two functions - compute and collapse - do. If you check it out by ?collect, the docs said the following:

Description:

‘compute’ forces computation of lazy tbls, leaving data in the remote source. ‘collect’ also forces computation, but will bring data back into an R data.frame (stored in a ‘tbl_df’). ‘collapse’ doesn't force computation, but collapses a complex tbl into a form that additional restrictions can be placed on.

What does this mean, specifically forces computation of lazy tlbs?


UPDATE

I would like to know what these functions do, and would like to get a clarification of what one does and the others don't.

like image 446
Blaszard Avatar asked Dec 31 '16 04:12

Blaszard


People also ask

What does collect () do in R?

collect: Collects all the elements of a SparkDataFrame and coerces them into an R data. frame.

What function from the Dplyr package can be used to create summaries of a variable at each level or setting of another variable or variables )?

12.10 group_by() The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable.

What is Dplyr package in R?

The dplyr package in R Programming Language is a structure of data manipulation that provides a uniform set of verbs, helping to resolve the most frequent data manipulation hurdles.


1 Answers

From one of the dplyr vignettes:

There are three ways to force the computation of a query:

  • collect() executes the query and returns the results to R.

  • compute() executes the query and stores the results in a temporary table in the database.

  • collapse() turns the query into a table expression.

collect() is the function you’ll use most. Once you reach the set of operations you want, you use collect() to pull the data into a local tbl_df(). If you know SQL, you can use compute() and collapse() to optimise performance.

If that's not helpful, your best bet is probably studying the source code of each function. You can see instructions on how to do that here: How do I see the help for the `dplyr::collect` method?

like image 102
Ben Avatar answered Oct 24 '22 03:10

Ben