Specifying dtypes for read_sql in pandas

Tags:

I would like to specify the dtypes returned when doing pandas.read_sql. In particular I am interested in saving memory and having float values returned as np.float32 instead of np.float64. I know that I can convert afterwards with astype(np.float32) but that doesn't solve the problem of the large memory requirements in the initial query. In my actual code, I will be pulling 84 million rows, not the 5 shown here. pandas.read_csv allows for specifying dtypes as a dict, but I see no way to do that with read_sql.

I am using MySQLdb and Python 2.7.

As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.

Click to copy

In [70]: df=pd.read_sql('select ARP, ACP from train where seq < 5', connection)

In [71]: df
Out[71]: 
   ARP      ACP
0  1.17915  1.42595
1  1.10578  1.21369
2  1.35629  1.12693
3  1.56740  1.61847
4  1.28060  1.05935


In [72]: df.dtypes
Out[72]: 
ARP    float64
ACP    float64
dtype: object

392

asked Aug 17 '16 15:08

SolverWorld

2 Answers

You can use pandas read_sql_query which allows you to specify the returned dtypes (supported only since pandas 1.3).

Click to copy

pd.read_sql_query('select ARP, ACP from train where seq < 5', connection,
                  dtype={'ARP': np.float32, 'ACP': np.float32})

answered Sep 16 '22 14:09

Ohad Bruker

As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.

Maybe you can try our tool ConnectorX (pip install -U connectorx), which is implemented in Rust and aims to improve the performance of pandas.read_sql in terms of both time and memory usage, and provides similar interface. To switch to it, you only need to:

Click to copy

import connectorx as cx
conn_url = "mysql://username:password@server:port/database"
query = "select ARP, ACP from train where seq < 5"
df = cx.read_sql(conn_url, query)

The reason pandas.read_sql uses a lot of memory during running is because of its large intermediate python objects, in ConnectorX we use Rust and stream process to tackle this problem.

Here is some benchmark result:

PostgreSQL:
MySQL:

answered Sep 20 '22 14:09

Xiaoying Wang

Related questions
                            
                                Set up mapping in Elasticsearch during Docker run
                            
                                Sliding on mobile issue (Iphone)
                            
                                Cordova Android Camera- giving illegal argument exception
                            
                                Should I call ConfigureAwait(false) on every awaited operation
                            
                                pip.conf not paying attention to trusted-host
                            
                                About Google Play App Signing
                            
                                How to achieve that placeholder text disappears character by character in UITextField
                            
                                Placement of const with clang-format
                            
                                Modules A and B export package some.package to module C in Java 9
                            
                                How to write Kafka consumers - single threaded vs multi threaded
                            
                                Incomplete type is not allowed in a class, but is allowed in a class template
                            
                                Why is ᏌᏊ ᎢᏳᎾᎵᏍᏔᏅ ᏍᎦᏚᎩ the native name of the U.S.?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Specifying dtypes for read_sql in pandas

Tags:

SolverWorld

People also ask

2 Answers

Ohad Bruker

Xiaoying Wang

Recent Activity

Donate For Us