Data is not perfectly clean, but is used without issue with pandas. The pandas library provides many extremely useful functions for EDA.
But when I use profiling for large data i.e 100 million records with 10 columns, reading it from a database table, it does not complete and my laptop runs out of memory, the size of data in csv is around 6 gb and my RAM is 14 GB my idle usage is around 3 - 4 GB approximately.
df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df)
profile.to_file(outputfile="myoutput.html")
I have also tried with check_recoded = False
option as well.
But it does not help in profiling entirely.
Is there any way to chunk and read the data and finally generate the summary report as a whole? OR any other method to use this function with large dataset.
v2.4
introduced the minimal mode that disables expensive computations (such as correlations and dynamic binning):
from pandas_profiling import ProfileReport
profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="output.html")
The syntax to disable the calculation of correlations (thereby heavily reducing calculations) has changed a lot between pandas-profiling=1.4
and the current (beta-)version pandas-profiling=2.0
to the following:
profile = df.profile_report(correlations={
"pearson": False,
"spearman": False,
"kendall": False,
"phi_k": False,
"cramers": False,
"recoded":False,}
)
Additionally, you can reduce performed calculations by disabling the calculations of bins for the plotting of histograms.
profile = df.profile_report(plot={'histogram': {'bins': None}}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With