Does pandas.DataFrame.groupby
create a copy of the data or just a view? In the (more probable) case of not creating a copy, what is the additional memory overhead and how does it scale with the original dataframe chracteristics (e.g. number of rows, columns, distinct groups)?
Transformations. Transformation on a group or a column returns an object that is indexed the same size of that is being grouped.
The function passed to apply must take a dataframe as its first argument and return a dataframe, a series or a scalar. apply will then take care of combining the results back together into a single dataframe or series. apply is therefore a highly flexible grouping method.
Groupby preserves the order of rows within each group. When calling apply, add group keys to index to identify pieces. Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
group by does not order the data neccessarily. A DB is designed to grab the data as fast as possible and only sort if necessary. So add the order by if you need a guaranteed order.
I did a little more research on this since someone asked me to help them with this question, and the pandas source code has been revised somewhat since the accepted answer was written.
According to what I can tell from the source code:
Groupby returns the groups on a Grouper object (i.e. Grouper.groups), which are “a specification for a groupby instruction”.
Ok, so what does that mean?
“Groupers are ultimately index mappings.”
I've always thought of this as meaning that groupby is creating a new object. It's not a full copy of the original dataframe, because you're performing selections and aggregations. So it's more like a transformation in that sense.
If your definition of a view is like this: "A view is nothing more than a SQL statement that is stored in the database with an associated name. A view is actually a composition of a table in the form of a predefined SQL query", then I'm wondering if what you're really asking is whether the groupby operation has to be re-applied each time you execute the same grouping on the same dataframe?
If that's what you're asking, I'd say the answer is no, it's not like a view, as long as you store the result of the grouping operation. The output object of a grouped dataframe or series is a (new) dataframe or series.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With