Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently handle large datasets in Python using Pandas?

I am working with a large dataset (approximately 1 million rows) in Python using the Pandas library, and I am experiencing performance issues when performing operations such as filtering and aggregating data.

Here is a simplified version of my code:

import pandas as pd

# Load the dataset
df = pd.read_csv('large_dataset.csv')

# Example operation: Filtering and aggregating
result = df[df['column_name'] > threshold_value].groupby('another_column').mean()

I've tried using df.memory_usage(deep=True) to analyze memory usage and pd.read_csv() with the chunksize parameter to load the data in chunks, but I still face slow performance.

**What are some best practices for optimizing data processing with Pandas for large datasets? ** Any suggestions on techniques, alternative libraries, or specific functions that could help improve performance would be greatly appreciated!

What I Tried:

Memory Analysis: I used df.memory_usage(deep=True) to understand memory consumption and found that certain columns were using a lot of memory due to their data types.

  1. Loading Data in Chunks: I attempted to load the dataset in chunks using the chunksize parameter with pd.read_csv(). This allowed me to work with smaller parts of the dataset, but my filtering and aggregation operations remained slow.

  2. Data Type Optimization: I experimented with changing data types of certain columns to more memory-efficient types (e.g., converting float64 to float32), which helped reduce memory usage but didn’t significantly improve the processing time.

  3. What I Was Expecting: I expected that by analyzing memory usage and optimizing data types, along with loading data in chunks, I would see a noticeable improvement in the speed of my filtering and aggregation operations. However, the performance remains suboptimal, especially with the large size of the dataset.

like image 289
hotnoob Avatar asked Feb 04 '26 04:02

hotnoob


1 Answers

It's difficult to say what exactly is the bottleneck without having access to the current file, columns and types (eg. dtypes) and memory usage statistics and profiles. But we can at least consider some options.

Toy dataset with 1M rows

First lets create some dummy dataset to illustrate some key points:

import numpy as np
import pandas as pd
import duckdb as db

N = 1_000_000

data = pd.DataFrame({
    "c": np.random.choice(["a", "b", "c"], size=N),
    "x": np.random.uniform(size=N),
    "y": np.random.normal(size=N)
})

data.to_csv("blob.csv")  # File is about 45 Mb
data.memory_usage()      # Memory fingerprint is about 22 Mb

Therefore 1M rows with 3 columns of interest should be bearable. Creating a subset of columns will reduce your memory fingerprint to some acceptable amount.

Explicitly select columns you want to aggregate

Now we will perform the aggregation as you suggested:

%timeit -r 10 -n 10
data[data["x"] > 0.5].groupby("c")["y"].mean()
# 137 ms ± 11.5 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

Notice the indexation ["y"] on specific columns before aggregation to prevent means to apply on all columns left by the group by operation. Indexing will reduce the volumetry of the answer and the intensity of computation.

Alternative to pandas

There is a nice package called duckdb that allow to perform such operation on large dataset while limiting footprint wrt pandas.

The equivalent query can then be stated in SQL directly from the CSV file:

%timeit -r 10 -n 10
db.query("SELECT c, MEAN(y) FROM blob.csv WHERE x > 0.5 GROUP BY c;") 
# 89.3 ms ± 1.01 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

Going further

Don't hesitate to provide give us more grain to grind (reference dataset, link to file, file sample, dtypes, actual statistics and profiles) in order to reproduce your issue.

like image 91
jlandercy Avatar answered Feb 06 '26 17:02

jlandercy