Pandas - SQL case statement equivalent

Tags:

pandas

NOTE: Looking for some help on an efficient way to do this besides a mega join and then calculating the difference between dates

I have table1 with country ID and a date (no duplicates of these values) and I want to summarize table2 information (which has country, date, cluster_x and a count variable, where cluster_x is cluster_1, cluster_2, cluster_3) so that table1 has appended to it each value of the cluster ID and the summarized count from table2 where date from table2 occurred within 30 days prior to date in table1.

I believe this is simple in SQL: How to do this in Pandas?

select a.date,a.country, 
sum(case when a.date - b.date between  1 and 30 then b.cluster_1 else 0 end) as cluster1,
sum(case when a.date - b.date between  1 and 30 then b.cluster_2 else 0 end) as cluster2,
sum(case when a.date - b.date between  1 and 30 then b.cluster_3 else 0 end) as cluster3

from  table1 a
left outer join table2 b
on a.country=b.country

group by a.date,a.country

EDIT:

Here is a somewhat altered example. Say this is table1, an aggregated data set with date, city, cluster and count. Below it is the "query" dataset (table2). in this case we want to sum the count field from table1 for cluster1,cluster2,cluster3 (there is actually 100 of them) corresponding to the country id as long as the date field in table1 is within 30 days prior.

So for example, the first row of the query dataset has date 2/2/2015 and country 1. In table 1, there is only one row within 30 days prior and it is for cluster 2 with count 2.

enter image description here

Here is a dump of the two tables in CSV:

date,country,cluster,count
2014-01-30,1,1,1
2015-02-03,1,1,3
2015-01-30,1,2,2
2015-04-15,1,2,5
2015-03-01,2,1,6
2015-07-01,2,2,4
2015-01-31,2,3,8
2015-01-21,2,1,2
2015-01-21,2,1,3

and table2:

date,country
2015-02-01,1
2015-04-21,1
2015-02-21,2

434

asked Apr 19 '16 15:04

B_Miner

1 Answers

Edit: Oop - wish I would have seen that edit about joining before submitting. Np, I'll leave this as it was fun practice. Critiques welcome.

Where table1 and table2 are located in the same directory as this script at "table1.csv" and "table2.csv", this should work.

I didn't get the same result as your examples with 30 days - had to bump it to 31 days, but I think the spirit is here:

import pandas as pd
import numpy as np

table1_path = './table1.csv'
table2_path = './table2.csv'

with open(table1_path) as f:
    table1 = pd.read_csv(f)
table1.date = pd.to_datetime(table1.date)

with open(table2_path) as f:
    table2 = pd.read_csv(f)
table2.date = pd.to_datetime(table2.date)

joined = pd.merge(table2, table1, how='outer', on=['country'])

joined['datediff'] = joined.date_x - joined.date_y

filtered = joined[(joined.datediff >= np.timedelta64(1, 'D')) & (joined.datediff <= np.timedelta64(31, 'D'))]

gb_date_x = filtered.groupby(['date_x', 'country', 'cluster'])

summed = pd.DataFrame(gb_date_x['count'].sum())

result = summed.unstack()
result.reset_index(inplace=True)
result.fillna(0, inplace=True)

My test output:

ipdb> table1
                 date  country  cluster  count
0 2014-01-30 00:00:00        1        1      1
1 2015-02-03 00:00:00        1        1      3
2 2015-01-30 00:00:00        1        2      2
3 2015-04-15 00:00:00        1        2      5
4 2015-03-01 00:00:00        2        1      6
5 2015-07-01 00:00:00        2        2      4
6 2015-01-31 00:00:00        2        3      8
7 2015-01-21 00:00:00        2        1      2
8 2015-01-21 00:00:00        2        1      3
ipdb> table2
                 date  country
0 2015-02-01 00:00:00        1
1 2015-04-21 00:00:00        1
2 2015-02-21 00:00:00        2

...

ipdb> result
                     date_x  country  count
cluster                                   1  2  3
0       2015-02-01 00:00:00        1      0  2  0
1       2015-02-21 00:00:00        2      5  0  8
2       2015-04-21 00:00:00        1      0  5  0

185

answered Oct 09 '22 22:10

Robert Rodkey

Related questions
                            
                                Pydot error: file format "png" not recognized
                            
                                Error while importing Tensorflow in python2.7 in Red Hat release 6.6. 'GLIBC_2.17 not found'
                            
                                Theano CUDA exception
                            
                                Spark: More Efficient Aggregation to join strings from different rows
                            
                                Why is Garbage Collection so Slow?
                            
                                Anaconda 3.5 (64bit Windows) Install cx_Oracle
                            
                                Create a formal linear function in Sympy
                            
                                TensorFlow installation results in ImportError: No module named tensorflow
                            
                                py2exe the following modules appear to be missing
                            
                                Pandas.read_excel reads date into timestamp, I want a string
                            
                                Motif search with Gibbs sampler
                            
                                run untrusted python code that is able to communicate with main program but isolated from the system
                            
                                gspread findall() only within 1 column
                            
                                What can cause the simple invocation of asyncio.new_event_loop() to hang?
                            
                                Extracting attributes from images using Scikit-image
                            
                                Create contingency table Pandas with counts and percentages
                            
                                What is the meaning of the error cannot handle a non-unique multi index in groupby clause?
                            
                                Cant stop\kill all processes at once produced by multiprocessing.Pool
                            
                                How to implement django otp?
                            
                                How can I stop python from converting a mySQL DATETIME to a datetime.date when the time is 00:00:00?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With