Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to force pandas read_csv to use float32 for all float columns?

Tags:

Because

  • I don't need double precision
  • My machine has limited memory and I want to process bigger datasets
  • I need to pass the extracted data (as matrix) to BLAS libraries, and BLAS calls for single precision are 2x faster than for double precision equivalence.

Note that not all columns in the raw csv file have float types. I only need to set float32 as the default for float columns.

like image 592
Fabian Avatar asked May 27 '15 23:05

Fabian


People also ask

Is read_csv faster than Read_excel?

Python loads CSV files 100 times faster than Excel files. Use CSVs.

What is the default separator in PD read_csv?

The default value of the sep parameter is the comma (,) which means if we don't specify the sep parameter in our read_csv() function, it is understood that our file is using comma as the delimiter.

What does Parse_dates do in read_csv?

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.


2 Answers

Try:

import numpy as np import pandas as pd  # Sample 100 rows of data to determine dtypes. df_test = pd.read_csv(filename, nrows=100)  float_cols = [c for c in df_test if df_test[c].dtype == "float64"] float32_cols = {c: np.float32 for c in float_cols}  df = pd.read_csv(filename, engine='c', dtype=float32_cols) 

This first reads a sample of 100 rows of data (modify as required) to determine the type of each column.

It the creates a list of those columns which are 'float64', and then uses dictionary comprehension to create a dictionary with these columns as the keys and 'np.float32' as the value for each key.

Finally, it reads the whole file using the 'c' engine (required for assigning dtypes to columns) and then passes the float32_cols dictionary as a parameter to dtype.

df = pd.read_csv(filename, nrows=100) >>> df    int_col  float1 string_col  float2 0        1     1.2          a     2.2 1        2     1.3          b     3.3 2        3     1.4          c     4.4  >>> df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 3 entries, 0 to 2 Data columns (total 4 columns): int_col       3 non-null int64 float1        3 non-null float64 string_col    3 non-null object float2        3 non-null float64 dtypes: float64(2), int64(1), object(1)  df32 = pd.read_csv(filename, engine='c', dtype={c: np.float32 for c in float_cols}) >>> df32.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 3 entries, 0 to 2 Data columns (total 4 columns): int_col       3 non-null int64 float1        3 non-null float32 string_col    3 non-null object float2        3 non-null float32 dtypes: float32(2), int64(1), object(1) 
like image 113
Alexander Avatar answered Sep 22 '22 08:09

Alexander


Here's a solution which does not depend on .join or does not require reading the file twice:

float64_cols = df.select_dtypes(include='float64').columns mapper = {col_name: np.float32 for col_name in float64_cols} df = df.astype(mapper) 

Or for kicks as a one-liner:

df = df.astype({c: np.float32 for c in df.select_dtypes(include='float64').columns}) 
like image 33
jorijnsmit Avatar answered Sep 26 '22 08:09

jorijnsmit