I am relatively new to the world of Python and trying to use it as a back-up platform to do data analysis. I generally use data.table
for my data analysis needs.
The issue is that when I run group-aggregate operation on big CSV file (randomized, zipped, uploaded at http://www.filedropper.com/ddataredact_1), Python throws:
grouping pandas return getattr(obj, method)(*args, **kwds) ValueError: negative dimensions are not allowed
OR (I have even encountered...)
File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\util.py", line 65, in cartesian_product for i, x in enumerate(X)] File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\util.py", line 65, in for i, x in enumerate(X)] File "C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 445, in repeat return _wrapfunc(a, 'repeat', repeats, axis=axis) File "C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 51, in _wrapfunc return getattr(obj, method)(*args, **kwds) MemoryError
I have spent three days trying to reduce the file size (I was able to reduce the size by 89%), adding breakpoints, debugging it, but I was not able to make any progress.
Surprisingly, I thought of running the same group/aggregate operation in data.table
in R, and it hardly took 1 second. Moreover, I didn't have to do any data type conversion etc., suggested at https://www.dataquest.io/blog/pandas-big-data/.
I also researched other threads: Avoiding Memory Issues For GroupBy on Large Pandas DataFrame, Pandas: df.groupby() is too slow for big data set. Any alternatives methods?, and pandas groupby with sum() on large csv file?. It seems these threads are more about matrix multiplication. I'd appreciate if you wouldn't tag this as duplicate.
Here's my Python code:
finaldatapath = "..\Data_R"
ddata = pd.read_csv(finaldatapath +"\\"+"ddata_redact.csv", low_memory=False,encoding ="ISO-8859-1")
#before optimization: 353MB
ddata.info(memory_usage="deep")
#optimize file: Object-types are the biggest culprit.
ddata_obj = ddata.select_dtypes(include=['object']).copy()
#Now convert this to category type:
#Float type didn't help much, so I am excluding it here.
for col in ddata_obj:
del ddata[col]
ddata.loc[:, col] = ddata_obj[col].astype('category')
#release memory
del ddata_obj
#after optimization: 39MB
ddata.info(memory_usage="deep")
#Create a list of grouping variables:
group_column_list = [
"Business",
"Device_Family",
"Geo",
"Segment",
"Cust_Name",
"GID",
"Device ID",
"Seller",
"C9Phone_Margins_Flag",
"C9Phone_Cust_Y_N",
"ANDroid_Lic_Type",
"Type",
"Term",
'Cust_ANDroid_Margin_Bucket',
'Cust_Mobile_Margin_Bucket',
# # 'Cust_Android_App_Bucket',
'ANDroind_App_Cust_Y_N'
]
print("Analyzing data now...")
def ddata_agg(x):
names = {
'ANDroid_Margin': x['ANDroid_Margin'].sum(),
'Margins': x['Margins'].sum(),
'ANDroid_App_Qty': x['ANDroid_App_Qty'].sum(),
'Apple_Margin':x['Apple_Margin'].sum(),
'P_Lic':x['P_Lic'].sum(),
'Cust_ANDroid_Margins':x['Cust_ANDroid_Margins'].mean(),
'Cust_Mobile_Margins':x['Cust_Mobile_Margins'].mean(),
'Cust_ANDroid_App_Qty':x['Cust_ANDroid_App_Qty'].mean()
}
return pd.Series(names)
ddata=ddata.reset_index(drop=True)
ddata = ddata.groupby(group_column_list).apply(ddata_agg)
The code crashes in above .groupby
operation.
Can someone please help me? Compared to my other posts, I have probably spent the most amount of time on this StackOverflow post, trying to fix it and learn new stuff about Python. However, I have reached saturation--it even more frustrates me because R
's data.table
package processes this file in <2 seconds. This post is not about pros and cons of R and Python, but about using Python to be more productive.
I am completely lost, and I'd appreciate any help.
Here's my data.table
R
code:
path_r = "../ddata_redact.csv"
ddata<-data.table::fread(path_r,stringsAsFactors=FALSE,data.table = TRUE, header = TRUE)
group_column_list <-c(
"Business",
"Device_Family",
"Geo",
"Segment",
"Cust_Name",
"GID",
"Device ID",
"Seller",
"C9Phone_Margins_Flag",
"C9Phone_Cust_Y_N",
"ANDroid_Lic_Type",
"Type",
"Term",
'Cust_ANDroid_Margin_Bucket',
'Cust_Mobile_Margin_Bucket',
# # 'Cust_Android_App_Bucket',
'ANDroind_App_Cust_Y_N'
)
ddata<-ddata[, .(ANDroid_Margin = sum(ANDroid_Margin,na.rm = TRUE),
Margins=sum(Margins,na.rm = TRUE),
Apple_Margin=sum(Apple_Margin,na.rm=TRUE),
Cust_ANDroid_Margins = mean(Cust_ANDroid_Margins,na.rm = TRUE),
Cust_Mobile_Margins = mean(Cust_Mobile_Margins,na.rm = TRUE),
Cust_ANDroid_App_Qty = mean(Cust_ANDroid_App_Qty,na.rm = TRUE),
ANDroid_App_Qty=sum(ANDroid_App_Qty,na.rm = TRUE)
),
by=group_column_list]
Adding to Josemz's comment, here are two threads on agg
vs. apply
: What is the difference between pandas agg and apply function? and Pandas difference between apply() and aggregate() functions
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Although Groupby is much faster than Pandas GroupBy. apply and GroupBy. transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small.
The Hello, World! of pandas GroupBy You call . groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to .
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
I think what you're looking for is agg instead of apply. You can pass a dict mapping columns to the functions you want to apply, so I think this would work for you:
ddata = ddata.groupby(group_column_list).agg({
'ANDroid_Margin' : sum,
'Margins' : sum,
'ANDroid_App_Qty' : sum,
'Apple_Margin' : sum,
'P_Lic' : sum,
'Cust_ANDroid_Margins': 'mean',
'Cust_Mobile_Margins' : 'mean',
'Cust_ANDroid_App_Qty': 'mean'})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With