Numpy seems to make a distinction between <code>str</code> and <code>object</code> types. For instance I can do :: <pre class="prettyprint"><code>>>> import pandas as pd >>> import numpy as np >>> np.dtype(str) dtype('S') >>> np.dtype(object) dtype('O') </code></pre> Where dtype('S') and dtype('O') corresponds to <code>str</code> and <code>object</code> respectively. However pandas seem to lack that distinction and coerce <code>str</code> to <code>object</code>. :: <pre class="prettyprint"><code>>>> df = pd.DataFrame({'a': np.arange(5)}) >>> df.a.dtype dtype('int64') >>> df.a.astype(str).dtype dtype('O') >>> df.a.astype(object).dtype dtype('O') </code></pre> Forcing the type to <code>dtype('S')</code> does not help either. :: <pre class="prettyprint"><code>>>> df.a.astype(np.dtype(str)).dtype dtype('O') >>> df.a.astype(np.dtype('S')).dtype dtype('O') </code></pre> Is there any explanation for this behavior?

Numpy's string dtypes aren't python strings. Therefore, <code>pandas</code> deliberately uses native python strings, which require an object dtype. First off, let me demonstrate a bit of what I mean by numpy's strings being different: <pre class="prettyprint"><code>In [1]: import numpy as np In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7') In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object) </code></pre> Now, 'x' is a <code>numpy</code> string dtype (fixed-width, c-like string) and <code>y</code> is an array of native python strings. If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated: <pre class="prettyprint"><code>In [4]: x[1] = 'a really really really long' In [5]: x Out[5]: array(['Testing', 'a reall', 'string'], dtype='|S7') </code></pre> While the object dtype versions can be arbitrary length: <pre class="prettyprint"><code>In [6]: y[1] = 'a really really really long' In [7]: y Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object) </code></pre> Next, the <code>|S</code> dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment. Finally, numpy's strings are actually mutable, while Python strings are not. For example: <pre class="prettyprint"><code>In [8]: z = x.view(np.uint8) In [9]: z += 1 In [10]: x Out[10]: array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'], dtype='|S7') </code></pre> For all of these reasons, <code>pandas</code> chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in <code>pandas</code>. Instead, it always uses native python strings, which behave in a more intuitive way for most users.

pandas distinction between str and object types

Tags:

python

pandas

numpy

Numpy seems to make a distinction between str and object types. For instance I can do ::

>>> import pandas as pd >>> import numpy as np >>> np.dtype(str) dtype('S') >>> np.dtype(object) dtype('O')

Where dtype('S') and dtype('O') corresponds to str and object respectively.

However pandas seem to lack that distinction and coerce str to object. ::

>>> df = pd.DataFrame({'a': np.arange(5)}) >>> df.a.dtype dtype('int64') >>> df.a.astype(str).dtype dtype('O') >>> df.a.astype(object).dtype dtype('O')

Forcing the type to dtype('S') does not help either. ::

>>> df.a.astype(np.dtype(str)).dtype dtype('O') >>> df.a.astype(np.dtype('S')).dtype dtype('O')

Is there any explanation for this behavior?

457

asked Jan 19 '16 15:01

Meitham

1 Answers

Numpy's string dtypes aren't python strings.

Therefore, pandas deliberately uses native python strings, which require an object dtype.

First off, let me demonstrate a bit of what I mean by numpy's strings being different:

In [1]: import numpy as np In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7') In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)

Now, 'x' is a numpy string dtype (fixed-width, c-like string) and y is an array of native python strings.

If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:

In [4]: x[1] = 'a really really really long' In [5]: x Out[5]: array(['Testing', 'a reall', 'string'],       dtype='|S7')

While the object dtype versions can be arbitrary length:

In [6]: y[1] = 'a really really really long'  In [7]: y Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)

Next, the |S dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.

Finally, numpy's strings are actually mutable, while Python strings are not. For example:

In [8]: z = x.view(np.uint8) In [9]: z += 1 In [10]: x Out[10]: array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],       dtype='|S7')

For all of these reasons, pandas chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.

109

answered Sep 19 '22 04:09

Joe Kington

Related questions
                            
                                Setting aspect ratio of 3D plot
                            
                                How to understand the equal sign '=' symbol in IMAP email text?
                            
                                Installing nose using pip, but bash doesn't recognize command on mac
                            
                                Django variable in base.html
                            
                                What does indirect = True/False in pytest.mark.parametrize do/mean?
                            
                                Handling large file uploads with Flask
                            
                                Check if file has a CSV format with Python
                            
                                Why are there no Makefiles for automation in Python projects?
                            
                                Install OpenCV in a Docker container
                            
                                difference between np.inf and float('Inf')
                            
                                Opinions on Unladen Swallow? [closed]
                            
                                Installing/uninstalling my module with pip
                            
                                how to call a function from another file?
                            
                                How to set React to production mode when using Gulp
                            
                                How to increase Jupyter notebook Memory limit?
                            
                                Suggested way to run multiple sql statements in python?
                            
                                Sqlite. How to get value of Auto Increment Primary Key after Insert, other than last_insert_rowid()?
                            
                                Adding attributes to instancemethods in Python
                            
                                Why is set_xlim() not setting the x-limits in my figure?
                            
                                What is the equivalent of python any() and all() functions in JavaScript?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With