Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to profile large datasets with Pandas profiling?

Data is not perfectly clean, but is used without issue with pandas. The pandas library provides many extremely useful functions for EDA.

But when I use profiling for large data i.e 100 million records with 10 columns, reading it from a database table, it does not complete and my laptop runs out of memory, the size of data in csv is around 6 gb and my RAM is 14 GB my idle usage is around 3 - 4 GB approximately.

df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df)
profile.to_file(outputfile="myoutput.html")

I have also tried with check_recoded = False option as well. But it does not help in profiling entirely. Is there any way to chunk and read the data and finally generate the summary report as a whole? OR any other method to use this function with large dataset.

like image 459
Viv Avatar asked May 08 '19 07:05

Viv


2 Answers

v2.4 introduced the minimal mode that disables expensive computations (such as correlations and dynamic binning):

from pandas_profiling import ProfileReport


profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="output.html")
like image 104
Giorgos Myrianthous Avatar answered Nov 08 '22 05:11

Giorgos Myrianthous


The syntax to disable the calculation of correlations (thereby heavily reducing calculations) has changed a lot between pandas-profiling=1.4 and the current (beta-)version pandas-profiling=2.0 to the following:

profile = df.profile_report(correlations={
    "pearson": False,
    "spearman": False,
    "kendall": False,
    "phi_k": False,
    "cramers": False,
    "recoded":False,}
)

Additionally, you can reduce performed calculations by disabling the calculations of bins for the plotting of histograms.

profile = df.profile_report(plot={'histogram': {'bins': None}}
like image 30
cptnJ Avatar answered Nov 08 '22 04:11

cptnJ