I am working with a large dataset (approximately 1 million rows) in Python using the Pandas library, and I am experiencing performance issues when performing operations such as filtering and aggregating data.
Here is a simplified version of my code:
import pandas as pd
# Load the dataset
df = pd.read_csv('large_dataset.csv')
# Example operation: Filtering and aggregating
result = df[df['column_name'] > threshold_value].groupby('another_column').mean()
I've tried using df.memory_usage(deep=True) to analyze memory usage and pd.read_csv() with the chunksize parameter to load the data in chunks, but I still face slow performance.
**What are some best practices for optimizing data processing with Pandas for large datasets? ** Any suggestions on techniques, alternative libraries, or specific functions that could help improve performance would be greatly appreciated!
What I Tried:
Memory Analysis: I used df.memory_usage(deep=True) to understand memory consumption and found that certain columns were using a lot of memory due to their data types.
Loading Data in Chunks: I attempted to load the dataset in chunks using the chunksize parameter with pd.read_csv(). This allowed me to work with smaller parts of the dataset, but my filtering and aggregation operations remained slow.
Data Type Optimization: I experimented with changing data types of certain columns to more memory-efficient types (e.g., converting float64 to float32), which helped reduce memory usage but didn’t significantly improve the processing time.
What I Was Expecting: I expected that by analyzing memory usage and optimizing data types, along with loading data in chunks, I would see a noticeable improvement in the speed of my filtering and aggregation operations. However, the performance remains suboptimal, especially with the large size of the dataset.
It's difficult to say what exactly is the bottleneck without having access to the current file, columns and types (eg. dtypes) and memory usage statistics and profiles. But we can at least consider some options.
First lets create some dummy dataset to illustrate some key points:
import numpy as np
import pandas as pd
import duckdb as db
N = 1_000_000
data = pd.DataFrame({
"c": np.random.choice(["a", "b", "c"], size=N),
"x": np.random.uniform(size=N),
"y": np.random.normal(size=N)
})
data.to_csv("blob.csv") # File is about 45 Mb
data.memory_usage() # Memory fingerprint is about 22 Mb
Therefore 1M rows with 3 columns of interest should be bearable. Creating a subset of columns will reduce your memory fingerprint to some acceptable amount.
Now we will perform the aggregation as you suggested:
%timeit -r 10 -n 10
data[data["x"] > 0.5].groupby("c")["y"].mean()
# 137 ms ± 11.5 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
Notice the indexation ["y"] on specific columns before aggregation to prevent means to apply on all columns left by the group by operation. Indexing will reduce the volumetry of the answer and the intensity of computation.
There is a nice package called duckdb that allow to perform such operation on large dataset while limiting footprint wrt pandas.
The equivalent query can then be stated in SQL directly from the CSV file:
%timeit -r 10 -n 10
db.query("SELECT c, MEAN(y) FROM blob.csv WHERE x > 0.5 GROUP BY c;")
# 89.3 ms ± 1.01 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
Don't hesitate to provide give us more grain to grind (reference dataset, link to file, file sample, dtypes, actual statistics and profiles) in order to reproduce your issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With