Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas equivalent to SQL window functions

Is there an idiomatic equivalent to SQL's window functions in Pandas? For example, what's the most compact way to write the equivalent of this in Pandas?:

SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name 

Or this?:

SELECT state_name,  
       state_population,
       region,
       SUM(state_population)
        OVER(PARTITION BY region) AS regional_population
FROM population    
ORDER BY state_name
like image 790
2daaa Avatar asked Jan 10 '17 16:01

2daaa


People also ask

What is window function in pandas?

pandas contains a compact set of APIs for performing windowing operations - an operation that performs an aggregation over a sliding partition of values.

Can I use pandas instead of SQL?

There are, of course, alternatives for both but they are the predominant ones in the field. Since both Pandas and SQL operate on tabular data, similar operations or queries can be done using both.

Is pandas more efficient than SQL?

SQL is more efficient in querying data but it has less functions whereas in pandas, there might be lag for large volumes of data but it has more functions which enable us to manipulate data in an effective way.

Is pandas same as SQL?

Pandas supports row AND column metadata; SQL only has column metadata. While Pandas supports column metadata (i.e., column labels) like databases, Pandas also supports row-wise metadata in the form of row labels. This is convenient if we want to organize and refer to data in an intuitive manner.


1 Answers

For the first SQL:

SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name 

Pandas:

df.assign(national_population=df.state_population.sum()).sort_values('state_name')

For the second SQL:

SELECT state_name,  
       state_population,
       region,
       SUM(state_population)
        OVER(PARTITION BY region) AS regional_population
FROM population    
ORDER BY state_name

Pandas:

df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
  .sort_values('state_name')

DEMO:

In [238]: df
Out[238]:
   region state_name  state_population
0       1        aaa               100
1       1        bbb               110
2       2        ccc               200
3       2        ddd               100
4       2        eee               100
5       3        xxx                55

national_population:

In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
Out[246]:
   region state_name  state_population  national_population
0       1        aaa               100                  665
1       1        bbb               110                  665
2       2        ccc               200                  665
3       2        ddd               100                  665
4       2        eee               100                  665
5       3        xxx                55                  665

regional_population:

In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
     ...:   .sort_values('state_name')
Out[239]:
   region state_name  state_population  regional_population
0       1        aaa               100                  210
1       1        bbb               110                  210
2       2        ccc               200                  400
3       2        ddd               100                  400
4       2        eee               100                  400
5       3        xxx                55                   55
like image 116
MaxU - stop WAR against UA Avatar answered Sep 28 '22 11:09

MaxU - stop WAR against UA