Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group several columns then aggregate a set of columns in Pandas (It crashes badly compared to R's data.table)

I am relatively new to the world of Python and trying to use it as a back-up platform to do data analysis. I generally use data.table for my data analysis needs.

The issue is that when I run group-aggregate operation on big CSV file (randomized, zipped, uploaded at http://www.filedropper.com/ddataredact_1), Python throws:

grouping pandas return getattr(obj, method)(*args, **kwds) ValueError: negative dimensions are not allowed

OR (I have even encountered...)

File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\util.py", line 65, in cartesian_product for i, x in enumerate(X)] File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\util.py", line 65, in for i, x in enumerate(X)] File "C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 445, in repeat return _wrapfunc(a, 'repeat', repeats, axis=axis) File "C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 51, in _wrapfunc return getattr(obj, method)(*args, **kwds) MemoryError

I have spent three days trying to reduce the file size (I was able to reduce the size by 89%), adding breakpoints, debugging it, but I was not able to make any progress.

Surprisingly, I thought of running the same group/aggregate operation in data.table in R, and it hardly took 1 second. Moreover, I didn't have to do any data type conversion etc., suggested at https://www.dataquest.io/blog/pandas-big-data/.

I also researched other threads: Avoiding Memory Issues For GroupBy on Large Pandas DataFrame, Pandas: df.groupby() is too slow for big data set. Any alternatives methods?, and pandas groupby with sum() on large csv file?. It seems these threads are more about matrix multiplication. I'd appreciate if you wouldn't tag this as duplicate.

Here's my Python code:

finaldatapath = "..\Data_R"
ddata = pd.read_csv(finaldatapath +"\\"+"ddata_redact.csv", low_memory=False,encoding ="ISO-8859-1")

#before optimization: 353MB
ddata.info(memory_usage="deep")

#optimize file: Object-types are the biggest culprit.
ddata_obj = ddata.select_dtypes(include=['object']).copy()

#Now convert this to category type:
#Float type didn't help much, so I am excluding it here.
for col in ddata_obj:
    del ddata[col]
    ddata.loc[:, col] = ddata_obj[col].astype('category')

#release memory
del ddata_obj

#after optimization: 39MB
ddata.info(memory_usage="deep")


#Create a list of grouping variables:
group_column_list = [
                 "Business",
                 "Device_Family",
                 "Geo",
                 "Segment",
                 "Cust_Name",
                 "GID",
                 "Device ID",
                 "Seller",
                "C9Phone_Margins_Flag",
                 "C9Phone_Cust_Y_N",
                 "ANDroid_Lic_Type",
                 "Type",
                 "Term",
                 'Cust_ANDroid_Margin_Bucket',
                 'Cust_Mobile_Margin_Bucket',
# #                'Cust_Android_App_Bucket',
                 'ANDroind_App_Cust_Y_N'
]

print("Analyzing data now...")

def ddata_agg(x):
    names = {
        'ANDroid_Margin': x['ANDroid_Margin'].sum(),
        'Margins': x['Margins'].sum(),
        'ANDroid_App_Qty': x['ANDroid_App_Qty'].sum(),
        'Apple_Margin':x['Apple_Margin'].sum(),
       'P_Lic':x['P_Lic'].sum(),
       'Cust_ANDroid_Margins':x['Cust_ANDroid_Margins'].mean(),
       'Cust_Mobile_Margins':x['Cust_Mobile_Margins'].mean(),
       'Cust_ANDroid_App_Qty':x['Cust_ANDroid_App_Qty'].mean()
    }
    return pd.Series(names)

ddata=ddata.reset_index(drop=True)

ddata = ddata.groupby(group_column_list).apply(ddata_agg) 

The code crashes in above .groupby operation.

Can someone please help me? Compared to my other posts, I have probably spent the most amount of time on this StackOverflow post, trying to fix it and learn new stuff about Python. However, I have reached saturation--it even more frustrates me because R's data.table package processes this file in <2 seconds. This post is not about pros and cons of R and Python, but about using Python to be more productive.

I am completely lost, and I'd appreciate any help.


Here's my data.table R code:

path_r = "../ddata_redact.csv"
ddata<-data.table::fread(path_r,stringsAsFactors=FALSE,data.table = TRUE, header = TRUE)

group_column_list <-c(
  "Business",
  "Device_Family",
  "Geo",
  "Segment",
  "Cust_Name",
  "GID",
  "Device ID",
  "Seller",
  "C9Phone_Margins_Flag",
  "C9Phone_Cust_Y_N",
  "ANDroid_Lic_Type",
  "Type",
  "Term",
  'Cust_ANDroid_Margin_Bucket',
  'Cust_Mobile_Margin_Bucket',
  # #                'Cust_Android_App_Bucket',
  'ANDroind_App_Cust_Y_N'
  )

    ddata<-ddata[, .(ANDroid_Margin = sum(ANDroid_Margin,na.rm = TRUE), 
Margins=sum(Margins,na.rm = TRUE), 
Apple_Margin=sum(Apple_Margin,na.rm=TRUE),  
Cust_ANDroid_Margins = mean(Cust_ANDroid_Margins,na.rm = TRUE),  
Cust_Mobile_Margins = mean(Cust_Mobile_Margins,na.rm = TRUE),
Cust_ANDroid_App_Qty = mean(Cust_ANDroid_App_Qty,na.rm = TRUE),
ANDroid_App_Qty=sum(ANDroid_App_Qty,na.rm = TRUE)
), 
by=group_column_list]

I have a 4-core 16GB RAM Win10x64 machine. I can provide any details needed by experts.

Adding to Josemz's comment, here are two threads on agg vs. apply: What is the difference between pandas agg and apply function? and Pandas difference between apply() and aggregate() functions

like image 949
watchtower Avatar asked Oct 17 '18 07:10

watchtower


People also ask

Can Groupby be used for multiple columns in pandas?

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

Is Groupby faster on index pandas?

Although Groupby is much faster than Pandas GroupBy. apply and GroupBy. transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small.

How do I group columns in pandas?

The Hello, World! of pandas GroupBy You call . groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to .

What does Groupby AGG do in pandas?

What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.


1 Answers

I think what you're looking for is agg instead of apply. You can pass a dict mapping columns to the functions you want to apply, so I think this would work for you:

ddata = ddata.groupby(group_column_list).agg({
    'ANDroid_Margin'      : sum,
    'Margins'             : sum,
    'ANDroid_App_Qty'     : sum,
    'Apple_Margin'        : sum,
    'P_Lic'               : sum,
    'Cust_ANDroid_Margins': 'mean',
    'Cust_Mobile_Margins' : 'mean',
    'Cust_ANDroid_App_Qty': 'mean'})
like image 118
josemz Avatar answered Oct 08 '22 19:10

josemz