According to the pandas documentation, pandas.read_csv
allows me to specify a dtype
for the columns in the CSV file.
dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’). Use str or object to preserve and not interpret dtype.
To treat every column as text data, I can use either
df = pandas.read_csv(... , dtype=str)
or
df = pandas.read_csv(..., dtype=object)
As far as I know, these two methods always behave exactly the same. Are there any situations in which these two methods behave differently? If so, what are the differences?
These had a subtle difference, until release 0.11.1 (see issue #3795).
Every element in a numpy array must have the same size in bytes. The issue with strings is that their size in bytes is not fixed, hence the object
dtype allows pointers to strings which do have a fixed byte size. So in short, str
has a special fixed width for each item, whereas object
allows variable string length, or really any object.
In any case, since release 0.11.1 there is an auto-conversion from dtype=str
to dtype=object
whenever it is seen, so it does not matter what you use, although I would advise avoiding str
altogether and just use dtype=object
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With