Is there an idiomatic equivalent to SQL's window functions in Pandas? For example, what's the most compact way to write the equivalent of this in Pandas?:
SELECT state_name,
state_population,
SUM(state_population)
OVER() AS national_population
FROM population
ORDER BY state_name
Or this?:
SELECT state_name,
state_population,
region,
SUM(state_population)
OVER(PARTITION BY region) AS regional_population
FROM population
ORDER BY state_name
pandas contains a compact set of APIs for performing windowing operations - an operation that performs an aggregation over a sliding partition of values.
There are, of course, alternatives for both but they are the predominant ones in the field. Since both Pandas and SQL operate on tabular data, similar operations or queries can be done using both.
SQL is more efficient in querying data but it has less functions whereas in pandas, there might be lag for large volumes of data but it has more functions which enable us to manipulate data in an effective way.
Pandas supports row AND column metadata; SQL only has column metadata. While Pandas supports column metadata (i.e., column labels) like databases, Pandas also supports row-wise metadata in the form of row labels. This is convenient if we want to organize and refer to data in an intuitive manner.
For the first SQL:
SELECT state_name,
state_population,
SUM(state_population)
OVER() AS national_population
FROM population
ORDER BY state_name
Pandas:
df.assign(national_population=df.state_population.sum()).sort_values('state_name')
For the second SQL:
SELECT state_name,
state_population,
region,
SUM(state_population)
OVER(PARTITION BY region) AS regional_population
FROM population
ORDER BY state_name
Pandas:
df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
.sort_values('state_name')
DEMO:
In [238]: df
Out[238]:
region state_name state_population
0 1 aaa 100
1 1 bbb 110
2 2 ccc 200
3 2 ddd 100
4 2 eee 100
5 3 xxx 55
national_population:
In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
Out[246]:
region state_name state_population national_population
0 1 aaa 100 665
1 1 bbb 110 665
2 2 ccc 200 665
3 2 ddd 100 665
4 2 eee 100 665
5 3 xxx 55 665
regional_population:
In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
...: .sort_values('state_name')
Out[239]:
region state_name state_population regional_population
0 1 aaa 100 210
1 1 bbb 110 210
2 2 ccc 200 400
3 2 ddd 100 400
4 2 eee 100 400
5 3 xxx 55 55
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With