Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between `str` and `object` data types in `pandas.read_csv`?

According to the pandas documentation, pandas.read_csv allows me to specify a dtype for the columns in the CSV file.

dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’). Use str or object to preserve and not interpret dtype.

To treat every column as text data, I can use either

df = pandas.read_csv(... , dtype=str)

or

df = pandas.read_csv(..., dtype=object)

As far as I know, these two methods always behave exactly the same. Are there any situations in which these two methods behave differently? If so, what are the differences?

like image 310
DGrady Avatar asked Feb 07 '23 00:02

DGrady


1 Answers

These had a subtle difference, until release 0.11.1 (see issue #3795).

Every element in a numpy array must have the same size in bytes. The issue with strings is that their size in bytes is not fixed, hence the object dtype allows pointers to strings which do have a fixed byte size. So in short, str has a special fixed width for each item, whereas object allows variable string length, or really any object.

In any case, since release 0.11.1 there is an auto-conversion from dtype=str to dtype=object whenever it is seen, so it does not matter what you use, although I would advise avoiding str altogether and just use dtype=object.

like image 92
miradulo Avatar answered Feb 12 '23 10:02

miradulo