I've a CSV file. Most of it's values I want to read as string, but I want to read a column as bool if the column with the given title exists.. Because the CSV file has a lots of columns, I don't want to specify on each column the datatype directly and give something like this: <pre class="prettyprint"><code>data = read_csv('sample.csv', dtype={'A': str, 'B': str, ..., 'X': bool}) </code></pre> Is it possible to define the string type on each column but one and read an optional column as a bool at the same time? My current solution is the following (but it's very unefficient and slow): <pre class="prettyprint"><code>data = read_csv('sample.csv', dtype=str) # reads all column as string if 'X' in data.columns: l = lambda row: True if row['X'] == 'True' else False if row['X'] == 'False' else None data['X'] = data.apply(l, axis=1) </code></pre> UPDATE: Sample CSV: <pre class="prettyprint"><code>A;B;C;X a1;b1;c1;True a2;b2;c2;False a3;b3;c3;True </code></pre> Or the same can ba without the 'X' column (because the column is optional): <pre class="prettyprint"><code>A;B;C a1;b1;c1 a2;b2;c2 a3;b3;c3 </code></pre>

You can first filter columns <code>contains</code> value <code>X</code> with <code>boolean indexing</code> and then <code>replace</code>: <pre class="prettyprint"><code>cols = df.columns[df.columns.str.contains('X')] df[cols] = df[cols].replace({'True': True, 'False': False}) </code></pre> Or if need filter column <code>X</code>: <pre class="prettyprint"><code>cols = df.columns[df.columns == 'X'] df[cols] = df[cols].replace({'True': True, 'False': False}) </code></pre> Sample: <pre class="prettyprint"><code>import pandas as pd df = pd.DataFrame({'A':['a1','a2','a3'], 'B':['b1','b2','b3'], 'C':['c1','c2','c3'], 'X':['True','False','True']}) print (df) A B C X 0 a1 b1 c1 True 1 a2 b2 c2 False 2 a3 b3 c3 True </code></pre> <pre class="prettyprint"><code>print (df.dtypes) A object B object C object X object dtype: object cols = df.columns[df.columns.str.contains('X')] print (cols) Index(['X'], dtype='object') df[cols] = df[cols].replace({'True': True, 'False': False}) print (df.dtypes) A object B object C object X bool dtype: object print (df) A B C X 0 a1 b1 c1 True 1 a2 b2 c2 False 2 a3 b3 c3 True </code></pre>

why not use <code>bool()</code> data type. <code>bool()</code> evaluates to true if a parameter is passed and the parameter is not False, None, '', or 0 <pre class="prettyprint"><code>if 'X' in data.columns: try: l = bool(data.columns['X'].replace('False', 0)) except: l = None data['X'] = data.apply(l, axis=1) </code></pre>

Pandas read_csv dtype specify all columns but one

Tags:

python

pandas

dataframe

csv

I've a CSV file. Most of it's values I want to read as string, but I want to read a column as bool if the column with the given title exists..

Because the CSV file has a lots of columns, I don't want to specify on each column the datatype directly and give something like this:

data = read_csv('sample.csv', dtype={'A': str, 'B': str, ..., 'X': bool})

Is it possible to define the string type on each column but one and read an optional column as a bool at the same time?

My current solution is the following (but it's very unefficient and slow):

data = read_csv('sample.csv', dtype=str) # reads all column as string
if 'X' in data.columns:
    l = lambda row: True if row['X'] == 'True' else False if row['X'] == 'False' else None
    data['X'] = data.apply(l, axis=1)

UPDATE: Sample CSV:

A;B;C;X
a1;b1;c1;True
a2;b2;c2;False
a3;b3;c3;True

Or the same can ba without the 'X' column (because the column is optional):

A;B;C
a1;b1;c1
a2;b2;c2
a3;b3;c3

594

asked May 29 '16 23:05

elaspog

2 Answers

You can first filter columns contains value X with boolean indexing and then replace:

cols = df.columns[df.columns.str.contains('X')]
df[cols] = df[cols].replace({'True': True, 'False': False})

Or if need filter column X:

cols = df.columns[df.columns == 'X']
df[cols] = df[cols].replace({'True': True, 'False': False})

Sample:

import pandas as pd

df = pd.DataFrame({'A':['a1','a2','a3'],
                   'B':['b1','b2','b3'],
                   'C':['c1','c2','c3'],
                   'X':['True','False','True']})

print (df)
    A   B   C      X
0  a1  b1  c1   True
1  a2  b2  c2  False
2  a3  b3  c3   True

print (df.dtypes)
A    object
B    object
C    object
X    object
dtype: object

cols = df.columns[df.columns.str.contains('X')]
print (cols)

Index(['X'], dtype='object')

df[cols] = df[cols].replace({'True': True, 'False': False})

print (df.dtypes)
A    object
B    object
C    object
X      bool
dtype: object
print (df)

    A   B   C      X
0  a1  b1  c1   True
1  a2  b2  c2  False
2  a3  b3  c3   True

answered Oct 16 '22 10:10

jezrael

why not use bool() data type. bool() evaluates to true if a parameter is passed and the parameter is not False, None, '', or 0

if 'X' in data.columns:
    try:
        l = bool(data.columns['X'].replace('False', 0))
    except:
        l = None
    data['X'] = data.apply(l, axis=1)

answered Oct 16 '22 10:10

TheLazyScripter

Related questions
                            
                                Django: Override Setting used in AppConfig Ready Function
                            
                                What is the correct way of passing parameters to stats.friedmanchisquare based on a DataFrame?
                            
                                Virtualenv: Command not found
                            
                                How to make Django Queryset that selects records with max value within a group
                            
                                C++ and cython - Seeking a design pattern that avoids template limitations
                            
                                Error using callback in Python
                            
                                How to get world coordinates from screen coordinates in Vispy
                            
                                OpenCV findContours in python
                            
                                django __init__ method causing argument error
                            
                                pymongo MongoClient can not work in multiprocess?
                            
                                How to avoid one Spark Streaming window blocking another window with both running some native Python code
                            
                                Flask MethodView vs Flask-Restful Resource
                            
                                Pickle Tfidfvectorizer along with a custom tokenizer
                            
                                How to authenticate with gcloud big query using a json credentials file?
                            
                                How do I upgrade packages used by iPython?
                            
                                Adding subtitles to a movie using moviepy
                            
                                vscode python go to symbol not working
                            
                                How to prevent the same task to be executed by celery?
                            
                                /usr/local/bin/python3: bad interpreter: No such file or directory for ubuntu 14.04
                            
                                AttributeError: 'module' object has no attribute 'PROTOCOL_TLSv1_2' with Python 2.7.11

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With