Numpy seems to make a distinction between str
and object
types. For instance I can do ::
>>> import pandas as pd >>> import numpy as np >>> np.dtype(str) dtype('S') >>> np.dtype(object) dtype('O')
Where dtype('S') and dtype('O') corresponds to str
and object
respectively.
However pandas seem to lack that distinction and coerce str
to object
. ::
>>> df = pd.DataFrame({'a': np.arange(5)}) >>> df.a.dtype dtype('int64') >>> df.a.astype(str).dtype dtype('O') >>> df.a.astype(object).dtype dtype('O')
Forcing the type to dtype('S')
does not help either. ::
>>> df.a.astype(np.dtype(str)).dtype dtype('O') >>> df.a.astype(np.dtype('S')).dtype dtype('O')
Is there any explanation for this behavior?
They can not only include strings, but also any other data that Pandas doesn't understand. How is this important? When a column is Object type, it does not necessarily mean that all the values will be string. In fact, they can all be numbers, or a mixture of string, integers and floats.
So in short, str has a special fixed width for each item, whereas object allows variable string length, or really any object.
This means, if you say when a column is an Object dtype, and it doesn't mean all the values in that column will be a string or text data. In fact, they may be numbers, or a mixture of string, integers, and floats dtype. So with this incompatibility, we can not do any string operations on that column directly.
An object is a string in pandas so it performs a string operation instead of a mathematical one. If we want to see what all the data types are in a dataframe, use df.dtypes. df.
Numpy's string dtypes aren't python strings.
Therefore, pandas
deliberately uses native python strings, which require an object dtype.
First off, let me demonstrate a bit of what I mean by numpy's strings being different:
In [1]: import numpy as np In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7') In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)
Now, 'x' is a numpy
string dtype (fixed-width, c-like string) and y
is an array of native python strings.
If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:
In [4]: x[1] = 'a really really really long' In [5]: x Out[5]: array(['Testing', 'a reall', 'string'], dtype='|S7')
While the object dtype versions can be arbitrary length:
In [6]: y[1] = 'a really really really long' In [7]: y Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)
Next, the |S
dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.
Finally, numpy's strings are actually mutable, while Python strings are not. For example:
In [8]: z = x.view(np.uint8) In [9]: z += 1 In [10]: x Out[10]: array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'], dtype='|S7')
For all of these reasons, pandas
chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas
. Instead, it always uses native python strings, which behave in a more intuitive way for most users.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With