I've a CSV file. Most of it's values I want to read as string, but I want to read a column as bool if the column with the given title exists..
Because the CSV file has a lots of columns, I don't want to specify on each column the datatype directly and give something like this:
data = read_csv('sample.csv', dtype={'A': str, 'B': str, ..., 'X': bool})
Is it possible to define the string type on each column but one and read an optional column as a bool at the same time?
My current solution is the following (but it's very unefficient and slow):
data = read_csv('sample.csv', dtype=str) # reads all column as string
if 'X' in data.columns:
l = lambda row: True if row['X'] == 'True' else False if row['X'] == 'False' else None
data['X'] = data.apply(l, axis=1)
UPDATE: Sample CSV:
A;B;C;X
a1;b1;c1;True
a2;b2;c2;False
a3;b3;c3;True
Or the same can ba without the 'X' column (because the column is optional):
A;B;C
a1;b1;c1
a2;b2;c2
a3;b3;c3
If low_memory=False , then whole columns will be read in first, and then the proper types determined. For example, the column will be kept as objects (strings) as needed to preserve information. If low_memory=True (the default), then pandas reads in the data in chunks of rows, then appends them together.
If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.
You can first filter columns contains
value X
with boolean indexing
and then replace
:
cols = df.columns[df.columns.str.contains('X')]
df[cols] = df[cols].replace({'True': True, 'False': False})
Or if need filter column X
:
cols = df.columns[df.columns == 'X']
df[cols] = df[cols].replace({'True': True, 'False': False})
Sample:
import pandas as pd
df = pd.DataFrame({'A':['a1','a2','a3'],
'B':['b1','b2','b3'],
'C':['c1','c2','c3'],
'X':['True','False','True']})
print (df)
A B C X
0 a1 b1 c1 True
1 a2 b2 c2 False
2 a3 b3 c3 True
print (df.dtypes)
A object
B object
C object
X object
dtype: object
cols = df.columns[df.columns.str.contains('X')]
print (cols)
Index(['X'], dtype='object')
df[cols] = df[cols].replace({'True': True, 'False': False})
print (df.dtypes)
A object
B object
C object
X bool
dtype: object
print (df)
A B C X
0 a1 b1 c1 True
1 a2 b2 c2 False
2 a3 b3 c3 True
why not use bool()
data type. bool()
evaluates to true if a parameter is passed and the parameter is not False, None, '', or 0
if 'X' in data.columns:
try:
l = bool(data.columns['X'].replace('False', 0))
except:
l = None
data['X'] = data.apply(l, axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With