pandas function read_csv() reads a .csv file. Its documentation is here
According to documentation, we know:
dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’)
and
converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels
When using this function, I can call either pandas.read_csv('file',dtype=object)
or pandas.read_csv('file',converters=object)
. Obviously, converter, its name can says that data type will be converted but I wonder the case of dtype?
There is no datetime dtype to be set for read_csv as csv files can only contain strings, integers and floats. Setting a dtype to datetime will make pandas interpret the datetime as an object, meaning you will end up with a string.
To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.
This means, if you say when a column is an Object dtype, and it doesn't mean all the values in that column will be a string or text data. In fact, they may be numbers, or a mixture of string, integers, and floats dtype. So with this incompatibility, we can not do any string operations on that column directly.
Read a CSV File In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data. csv , which you specified with the first argument.
The semantic difference is that dtype
allows you to specify how to treat the values, for example, either as numeric or string type.
Converters allows you to parse your input data to convert it to a desired dtype using a conversion function, e.g, parsing a string value to datetime or to some other desired dtype.
Here we see that pandas tries to sniff the types:
In [2]: df = pd.read_csv(io.StringIO(t)) t="""int,float,date,str 001,3.31,2015/01/01,005""" df = pd.read_csv(io.StringIO(t)) df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null int64 float 1 non-null float64 date 1 non-null object str 1 non-null int64 dtypes: float64(1), int64(2), object(1) memory usage: 40.0+ bytes
You can see from the above that 001
and 005
are treated as int64
but the date string stays as str
.
If we say everything is object
then essentially everything is str
:
In [3]: df = pd.read_csv(io.StringIO(t), dtype=object).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null object float 1 non-null object date 1 non-null object str 1 non-null object dtypes: object(4) memory usage: 40.0+ bytes
Here we force the int
column to str
and tell parse_dates
to use the date_parser to parse the date column:
In [6]: pd.read_csv(io.StringIO(t), dtype={'int':'object'}, parse_dates=['date']).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null object float 1 non-null float64 date 1 non-null datetime64[ns] str 1 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 40.0+ bytes
Similarly we could've pass the to_datetime
function to convert the dates:
In [5]: pd.read_csv(io.StringIO(t), converters={'date':pd.to_datetime}).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null int64 float 1 non-null float64 date 1 non-null datetime64[ns] str 1 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(2) memory usage: 40.0 bytes
I would say that the main purpose for converters
is to manipulate the values of the column, not the datatype. The answer shared by @EdChum focuses on the idea of the dtypes
. It uses the pd.to_datetime
function.
Within this article https://medium.com/analytics-vidhya/make-the-most-out-of-your-pandas-read-csv-1531c71893b5 in the area about converters, you will see an example of changing a csv column, with values such as "185 lbs.", into something that removes the "lbs" from the text column. This is more of the idea behind the read_csv
converters
parameter.
What the .csv looks like (If the image doesn't show up, please go to the article.)
#creating functions to clean the columns w = lambda x: (x.replace('lbs.','')) r = lambda x: (x.replace('"','')) #using converters to apply the functions to the columns fighter = pd.read_csv('raw_fighter_details.csv' , converters={'Weight':w , 'Reach':r }, header=0, usecols = [0,1,2,3]) fighter.head(15)
The DataFrame
after using converters
on the Weight column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With