Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: read_csv how to force bool data to dtype bool instead of object

Tags:

python

pandas

I'm reading in a large flatfile which has timestamped data with multiple columns. Data has a boolean column which can be True/False or can have no entry(which evaluates to nan).

When reading the csv the bool column gets typecast as object which prevents saving the data in hdfstore because of serialization error.

example data:

A    B    C    D
a    1    2    true
b    5    7    false
c    3    2    true
d    9    4

I use the following command to read

import pandas as pd
pd.read_csv('data.csv', parse_dates=True)

One solution is to specify the dtype while reading in the csv but I was hoping for a more succinct solution like convert_objects where i can specify parse_numeric or parse_dates.

like image 984
Prasanjit Prakash Avatar asked Apr 20 '15 05:04

Prasanjit Prakash


1 Answers

You can use dtype, it accepts a dictionary for mapping columns:

dtype : Type name or dict of column -> type
    Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
import pandas as pd
import numpy as np
import io

# using your sample
csv_file = io.BytesIO('''
A    B    C    D
a    1    2    true
b    5    7    false
c    3    2    true
d    9    4''')

df = pd.read_csv(csv_file, sep=r'\s+', dtype={'D': np.bool})
# then fillna to convert NaN to False
df = df.fillna(value=False)

df 
   A  B  C      D
0  a  1  2   True
1  b  5  7  False
2  c  3  2   True
3  d  9  4  False

df.D.dtypes
dtype('bool')
like image 159
Anzel Avatar answered Oct 08 '22 08:10

Anzel