What's the difference between dtype and converters in pandas.read_csv?

Tags:

pandas function read_csv() reads a .csv file. Its documentation is here

According to documentation, we know:

dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’)

and

converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels

When using this function, I can call either pandas.read_csv('file',dtype=object) or pandas.read_csv('file',converters=object). Obviously, converter, its name can says that data type will be converted but I wonder the case of dtype?

238

asked Dec 07 '15 17:12

Bryan

2 Answers

The semantic difference is that dtype allows you to specify how to treat the values, for example, either as numeric or string type.

Converters allows you to parse your input data to convert it to a desired dtype using a conversion function, e.g, parsing a string value to datetime or to some other desired dtype.

Here we see that pandas tries to sniff the types:

In [2]: df = pd.read_csv(io.StringIO(t)) t="""int,float,date,str 001,3.31,2015/01/01,005""" df = pd.read_csv(io.StringIO(t)) df.info()  <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int      1 non-null int64 float    1 non-null float64 date     1 non-null object str      1 non-null int64 dtypes: float64(1), int64(2), object(1) memory usage: 40.0+ bytes

You can see from the above that 001 and 005 are treated as int64 but the date string stays as str.

If we say everything is object then essentially everything is str:

In [3]:     df = pd.read_csv(io.StringIO(t), dtype=object).info()  <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int      1 non-null object float    1 non-null object date     1 non-null object str      1 non-null object dtypes: object(4) memory usage: 40.0+ bytes

Here we force the int column to str and tell parse_dates to use the date_parser to parse the date column:

In [6]: pd.read_csv(io.StringIO(t), dtype={'int':'object'}, parse_dates=['date']).info()  <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int      1 non-null object float    1 non-null float64 date     1 non-null datetime64[ns] str      1 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 40.0+ bytes

Similarly we could've pass the to_datetime function to convert the dates:

In [5]: pd.read_csv(io.StringIO(t), converters={'date':pd.to_datetime}).info()  <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int      1 non-null int64 float    1 non-null float64 date     1 non-null datetime64[ns] str      1 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(2) memory usage: 40.0 bytes

163

answered Oct 04 '22 10:10

EdChum

I would say that the main purpose for converters is to manipulate the values of the column, not the datatype. The answer shared by @EdChum focuses on the idea of the dtypes. It uses the pd.to_datetime function.

Within this article https://medium.com/analytics-vidhya/make-the-most-out-of-your-pandas-read-csv-1531c71893b5 in the area about converters, you will see an example of changing a csv column, with values such as "185 lbs.", into something that removes the "lbs" from the text column. This is more of the idea behind the read_csv converters parameter.

What the .csv looks like (If the image doesn't show up, please go to the article.)
the csv file with 6 columns. Weight is column with entries like 145 lbs.

#creating functions to clean the columns w = lambda x: (x.replace('lbs.','')) r = lambda x: (x.replace('"','')) #using converters to apply the functions to the columns fighter = pd.read_csv('raw_fighter_details.csv' ,                        converters={'Weight':w , 'Reach':r },                        header=0,                        usecols = [0,1,2,3]) fighter.head(15)

The DataFrame after using converters on the Weight column.
enter image description here

answered Oct 04 '22 08:10

VISQL

Related questions
                            
                                Can you format pandas integers for display, like `pd.options.display.float_format` for floats?
                            
                                sklearn - Cross validation with multiple scores
                            
                                tensorflow deep neural network for regression always predict same results in one batch
                            
                                How to deploy python script?
                            
                                Python object conversion
                            
                                What do square brackets, "[]", mean in function/class documentation?
                            
                                How to force PyYAML to load strings as unicode objects?
                            
                                SHA-256 implementation in Python
                            
                                How to multiply functions in python?
                            
                                Django - exception handling best practice and sending customized error message
                            
                                Best video manipulation library for Python? [closed]
                            
                                Is there a library function in Python to turn a generator-function into a function returning a list?
                            
                                What is the fastest way to output large DataFrame into a CSV file?
                            
                                How do I run doctests with PyCharm?
                            
                                Most Pythonic way to declare an abstract class property
                            
                                Customize module search path (PYTHONPATH) via pipenv
                            
                                Resampling Minute data
                            
                                python: iterating through a dictionary with list values
                            
                                Fastest way to parse large CSV files in Pandas
                            
                                What is the difference between random.normalvariate() and random.gauss() in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the difference between dtype and converters in pandas.read_csv?

Tags:

python

types

pandas

converter

type-inference

Bryan

People also ask

2 Answers

EdChum

VISQL

Recent Activity

Donate For Us