Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specifying dtypes for read_sql in pandas

Tags:

I would like to specify the dtypes returned when doing pandas.read_sql. In particular I am interested in saving memory and having float values returned as np.float32 instead of np.float64. I know that I can convert afterwards with astype(np.float32) but that doesn't solve the problem of the large memory requirements in the initial query. In my actual code, I will be pulling 84 million rows, not the 5 shown here. pandas.read_csv allows for specifying dtypes as a dict, but I see no way to do that with read_sql.

I am using MySQLdb and Python 2.7.

As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.

In [70]: df=pd.read_sql('select ARP, ACP from train where seq < 5', connection)

In [71]: df
Out[71]: 
   ARP      ACP
0  1.17915  1.42595
1  1.10578  1.21369
2  1.35629  1.12693
3  1.56740  1.61847
4  1.28060  1.05935


In [72]: df.dtypes
Out[72]: 
ARP    float64
ACP    float64
dtype: object
like image 392
SolverWorld Avatar asked Aug 17 '16 15:08

SolverWorld


People also ask

How do I specify Dtype in pandas series?

Change data type of a series in PandasUse a numpy. dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy. dtype or Python type to cast one or more of the DataFrame's columns to column-specific types.

What does PD Read_sql do?

read_sql. Read SQL query or database table into a DataFrame. This function is a convenience wrapper around read_sql_table and read_sql_query (for backward compatibility).


2 Answers

You can use pandas read_sql_query which allows you to specify the returned dtypes (supported only since pandas 1.3).

pd.read_sql_query('select ARP, ACP from train where seq < 5', connection,
                  dtype={'ARP': np.float32, 'ACP': np.float32})

like image 69
Ohad Bruker Avatar answered Sep 16 '22 14:09

Ohad Bruker


As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.

Maybe you can try our tool ConnectorX (pip install -U connectorx), which is implemented in Rust and aims to improve the performance of pandas.read_sql in terms of both time and memory usage, and provides similar interface. To switch to it, you only need to:

import connectorx as cx
conn_url = "mysql://username:password@server:port/database"
query = "select ARP, ACP from train where seq < 5"
df = cx.read_sql(conn_url, query)

The reason pandas.read_sql uses a lot of memory during running is because of its large intermediate python objects, in ConnectorX we use Rust and stream process to tackle this problem.

Here is some benchmark result:

  • PostgreSQL: postgres memory

  • MySQL: mysql memory

like image 33
Xiaoying Wang Avatar answered Sep 20 '22 14:09

Xiaoying Wang