I would like to specify the dtypes returned when doing pandas.read_sql. In particular I am interested in saving memory and having float values returned as np.float32 instead of np.float64. I know that I can convert afterwards with astype(np.float32) but that doesn't solve the problem of the large memory requirements in the initial query. In my actual code, I will be pulling 84 million rows, not the 5 shown here. pandas.read_csv allows for specifying dtypes as a dict, but I see no way to do that with read_sql.
I am using MySQLdb and Python 2.7.
As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.
In [70]: df=pd.read_sql('select ARP, ACP from train where seq < 5', connection)
In [71]: df
Out[71]:
ARP ACP
0 1.17915 1.42595
1 1.10578 1.21369
2 1.35629 1.12693
3 1.56740 1.61847
4 1.28060 1.05935
In [72]: df.dtypes
Out[72]:
ARP float64
ACP float64
dtype: object
Change data type of a series in PandasUse a numpy. dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy. dtype or Python type to cast one or more of the DataFrame's columns to column-specific types.
read_sql. Read SQL query or database table into a DataFrame. This function is a convenience wrapper around read_sql_table and read_sql_query (for backward compatibility).
You can use pandas read_sql_query which allows you to specify the returned dtypes (supported only since pandas 1.3).
pd.read_sql_query('select ARP, ACP from train where seq < 5', connection,
dtype={'ARP': np.float32, 'ACP': np.float32})
As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.
Maybe you can try our tool ConnectorX (pip install -U connectorx
), which is implemented in Rust and aims to improve the performance of pandas.read_sql
in terms of both time and memory usage, and provides similar interface. To switch to it, you only need to:
import connectorx as cx
conn_url = "mysql://username:password@server:port/database"
query = "select ARP, ACP from train where seq < 5"
df = cx.read_sql(conn_url, query)
The reason pandas.read_sql
uses a lot of memory during running is because of its large intermediate python objects, in ConnectorX
we use Rust and stream process to tackle this problem.
Here is some benchmark result:
PostgreSQL:
MySQL:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With