Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge pandas groupBy objects

Tags:

I have a huge dataset of 292 million rows (6GB) in CSV format. Panda's read_csv function is not working for such big file. So I am reading data in small chunks (10 million rows) iteratively using this code :

for chunk in pd.read_csv('hugeData.csv', chunksize=10**7):
       #something ...

In the #something I am grouping rows according to some columns. So in each iteration, I get new groupBy objects. I am not able to merge these groupBy objects.

A smaller dummy example is as follows :

Here dummy.csv is a 28 rows CSV file, which is trade report between some countries in some year. sitc is some product code and export is export amount in some USD billion. (Please note that data is fictional)

year,origin,dest,sitc,export
2000,ind,chn,2146,2
2000,ind,chn,4132,7
2001,ind,chn,2146,3
2001,ind,chn,4132,10
2002,ind,chn,2227,7
2002,ind,chn,4132,7
2000,ind,aus,7777,19
2001,ind,aus,2146,30
2001,ind,aus,4132,12
2002,ind,aus,4133,30
2000,aus,ind,4132,6
2001,aus,ind,2146,8
2001,chn,aus,1777,9
2001,chn,aus,1977,31
2001,chn,aus,1754,12
2002,chn,aus,8987,7
2001,chn,aus,4879,3
2002,aus,chn,3489,7
2002,chn,aus,2092,30
2002,chn,aus,4133,13
2002,aus,ind,0193,6
2002,aus,ind,0289,8
2003,chn,aus,0839,9
2003,chn,aus,9867,31
2003,aus,chn,3442,3
2004,aus,chn,3344,17
2005,aus,chn,3489,11
2001,aus,ind,0893,17

I split it into two 14 rows data and grouped them according to year, origin, dest.

 for chunk in pd.read_csv('dummy.csv', chunksize=14):
       xd = chunk.groupby(['origin','dest','year'])['export'].sum();
       print(xd)

Results :

origin  dest  year
aus     ind   2000     6
              2001     8
chn     aus   2001    40
ind     aus   2000    19
              2001    42
              2002    30
        chn   2000     9
              2001    13
              2002    14
Name: export, dtype: int64
origin  dest  year
aus     chn   2002     7
              2003     3
              2004    17
              2005    11
        ind   2001    17
              2002    14
chn     aus   2001    15
              2002    50
              2003    40
Name: export, dtype: int64

How can I merge the two GroupBy objects?

Will merging them, again create memory issues in the big data? A prediction by looking at the nature of data, if properly merged the number of rows will surely reduce by at least 10-15 times.

The basic aim is :

Given origin country and dest country, I need to plot total exports between them yearwise. Querying this everytime over the whole data is taking a lot of time.

xd = chunk.loc[(chunk.origin == country1) & (chunk.dest == country2)]

Hence I was thinking to save time by once arranging them in groupBy manner.

Any suggestion is greatly appreciated.

like image 597
Pushpendu Ghosh Avatar asked Sep 20 '18 12:09

Pushpendu Ghosh


People also ask

How do I merge Groupby in pandas?

Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.

Can you merge objects pandas?

Pandas DataFrame merge() function is used to merge two DataFrame objects with a database-style join operation. The joining is performed on columns or indexes. If the joining is done on columns, indexes are ignored. This function returns a new DataFrame and the source DataFrame objects are unchanged.

What can you do with a Groupby object?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.

What is the difference between merge join and concatenate?

merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.


1 Answers

You can use pd.concat to join groupby results and then apply sum:

>>> pd.concat([xd0,xd1],axis=1)
                  export  export
origin dest year                
aus    ind  2000       6       6
            2001       8       8
chn    aus  2001      40      40
ind    aus  2000      19      19
            2001      42      42
            2002      30      30
       chn  2000       9       9
            2001      13      13
            2002      14      14

>>> pd.concat([xd0,xd1],axis=1).sum(axis=1)
origin  dest  year
aus     ind   2000    12
              2001    16
chn     aus   2001    80
ind     aus   2000    38
              2001    84
              2002    60
        chn   2000    18
              2001    26
              2002    28
like image 139
hellpanderr Avatar answered Sep 28 '22 05:09

hellpanderr