Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas using too much memory with read_sql_table

I am trying to read in a table from my Postgres database into Python. Table has around 8 million rows and 17 columns, and has a size of 622MB in the DB.

I can export the entire table to csv using psql, and then use pd.read_csv() to read it in. It works perfectly fine. Python process only uses around 1GB of memory and everything is good.

Now, the task we need to do requires this pull to be automated, so I thought I could read the table in using pd.read_sql_table() directly from the DB. Using the following code

import sqlalchemy
engine = sqlalchemy.create_engine("postgresql://username:password@hostname:5432/db")
the_frame = pd.read_sql_table(table_name='table_name', con=engine,schema='schemaname') 

This approach starts using a lot of memory. When I track the memory usage using Task Manager, I can see the Python process memory usage climb and climb, until it hits all the way up to 16GB and freezes the computer.

Any ideas on why this might be happening is appreciated.

like image 468
user4505419 Avatar asked Dec 21 '16 00:12

user4505419


People also ask

How do I reduce panda memory usage?

Changing numeric columns to smaller dtype: Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage.

How many GB can Pandas handle?

The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage.

Is Pandas To_sql fast?

to_sql seems to send an INSERT query for every row which makes it really slow. But since 0.24. 0 there is a method parameter in pandas. to_sql() where you can define your own insertion function or just use method='multi' to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.

Are Pandas memory intensive?

Pandas is a popular Python package for data science, as it offers powerful, expressive, and flexible data structures for data explorations and visualization. But when it comes to handling large-sized datasets, it fails, as it cannot process larger than memory data.


1 Answers

You need to set the chunksize argument so that pandas will iterate over smaller chunks of data. See this post: https://stackoverflow.com/a/31839639/3707607

like image 52
Ted Petrou Avatar answered Oct 10 '22 15:10

Ted Petrou