Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between dtype and converters in pandas.read_csv?

pandas function read_csv() reads a .csv file. Its documentation is here

According to documentation, we know:

dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’)

and

converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels

When using this function, I can call either pandas.read_csv('file',dtype=object) or pandas.read_csv('file',converters=object). Obviously, converter, its name can says that data type will be converted but I wonder the case of dtype?

like image 238
Bryan Avatar asked Dec 07 '15 17:12

Bryan


People also ask

What is Dtype in read_csv?

There is no datetime dtype to be set for read_csv as csv files can only contain strings, integers and floats. Setting a dtype to datetime will make pandas interpret the datetime as an object, meaning you will end up with a string.

What is a Dtype in pandas?

To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.

Is Dtype object same as string?

This means, if you say when a column is an Object dtype, and it doesn't mean all the values in that column will be a string or text data. In fact, they may be numbers, or a mixture of string, integers, and floats dtype. So with this incompatibility, we can not do any string operations on that column directly.

What output type does pandas read_csv () return?

Read a CSV File In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data. csv , which you specified with the first argument.


2 Answers

The semantic difference is that dtype allows you to specify how to treat the values, for example, either as numeric or string type.

Converters allows you to parse your input data to convert it to a desired dtype using a conversion function, e.g, parsing a string value to datetime or to some other desired dtype.

Here we see that pandas tries to sniff the types:

In [2]: df = pd.read_csv(io.StringIO(t)) t="""int,float,date,str 001,3.31,2015/01/01,005""" df = pd.read_csv(io.StringIO(t)) df.info()  <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int      1 non-null int64 float    1 non-null float64 date     1 non-null object str      1 non-null int64 dtypes: float64(1), int64(2), object(1) memory usage: 40.0+ bytes 

You can see from the above that 001 and 005 are treated as int64 but the date string stays as str.

If we say everything is object then essentially everything is str:

In [3]:     df = pd.read_csv(io.StringIO(t), dtype=object).info()  <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int      1 non-null object float    1 non-null object date     1 non-null object str      1 non-null object dtypes: object(4) memory usage: 40.0+ bytes 

Here we force the int column to str and tell parse_dates to use the date_parser to parse the date column:

In [6]: pd.read_csv(io.StringIO(t), dtype={'int':'object'}, parse_dates=['date']).info()  <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int      1 non-null object float    1 non-null float64 date     1 non-null datetime64[ns] str      1 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 40.0+ bytes 

Similarly we could've pass the to_datetime function to convert the dates:

In [5]: pd.read_csv(io.StringIO(t), converters={'date':pd.to_datetime}).info()  <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int      1 non-null int64 float    1 non-null float64 date     1 non-null datetime64[ns] str      1 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(2) memory usage: 40.0 bytes 
like image 163
EdChum Avatar answered Oct 04 '22 10:10

EdChum


I would say that the main purpose for converters is to manipulate the values of the column, not the datatype. The answer shared by @EdChum focuses on the idea of the dtypes. It uses the pd.to_datetime function.

Within this article https://medium.com/analytics-vidhya/make-the-most-out-of-your-pandas-read-csv-1531c71893b5 in the area about converters, you will see an example of changing a csv column, with values such as "185 lbs.", into something that removes the "lbs" from the text column. This is more of the idea behind the read_csv converters parameter.

What the .csv looks like (If the image doesn't show up, please go to the article.)
the csv file with 6 columns. Weight is column with entries like 145 lbs.

#creating functions to clean the columns w = lambda x: (x.replace('lbs.','')) r = lambda x: (x.replace('"','')) #using converters to apply the functions to the columns fighter = pd.read_csv('raw_fighter_details.csv' ,                        converters={'Weight':w , 'Reach':r },                        header=0,                        usecols = [0,1,2,3]) fighter.head(15) 

The DataFrame after using converters on the Weight column.
enter image description here

like image 45
VISQL Avatar answered Oct 04 '22 08:10

VISQL