Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas distinction between str and object types

Numpy seems to make a distinction between str and object types. For instance I can do ::

>>> import pandas as pd >>> import numpy as np >>> np.dtype(str) dtype('S') >>> np.dtype(object) dtype('O') 

Where dtype('S') and dtype('O') corresponds to str and object respectively.

However pandas seem to lack that distinction and coerce str to object. ::

>>> df = pd.DataFrame({'a': np.arange(5)}) >>> df.a.dtype dtype('int64') >>> df.a.astype(str).dtype dtype('O') >>> df.a.astype(object).dtype dtype('O') 

Forcing the type to dtype('S') does not help either. ::

>>> df.a.astype(np.dtype(str)).dtype dtype('O') >>> df.a.astype(np.dtype('S')).dtype dtype('O') 

Is there any explanation for this behavior?

like image 457
Meitham Avatar asked Jan 19 '16 15:01

Meitham


People also ask

Is object type the same as string in Pandas?

They can not only include strings, but also any other data that Pandas doesn't understand. How is this important? When a column is Object type, it does not necessarily mean that all the values will be string. In fact, they can all be numbers, or a mixture of string, integers and floats.

What is the difference between string and object in Python?

So in short, str has a special fixed width for each item, whereas object allows variable string length, or really any object.

Is Dtype object same as string?

This means, if you say when a column is an Object dtype, and it doesn't mean all the values in that column will be a string or text data. In fact, they may be numbers, or a mixture of string, integers, and floats dtype. So with this incompatibility, we can not do any string operations on that column directly.

What is an object type in Pandas?

An object is a string in pandas so it performs a string operation instead of a mathematical one. If we want to see what all the data types are in a dataframe, use df.dtypes. df.


1 Answers

Numpy's string dtypes aren't python strings.

Therefore, pandas deliberately uses native python strings, which require an object dtype.

First off, let me demonstrate a bit of what I mean by numpy's strings being different:

In [1]: import numpy as np In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7') In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object) 

Now, 'x' is a numpy string dtype (fixed-width, c-like string) and y is an array of native python strings.

If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:

In [4]: x[1] = 'a really really really long' In [5]: x Out[5]: array(['Testing', 'a reall', 'string'],       dtype='|S7') 

While the object dtype versions can be arbitrary length:

In [6]: y[1] = 'a really really really long'  In [7]: y Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object) 

Next, the |S dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.

Finally, numpy's strings are actually mutable, while Python strings are not. For example:

In [8]: z = x.view(np.uint8) In [9]: z += 1 In [10]: x Out[10]: array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],       dtype='|S7') 

For all of these reasons, pandas chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.

like image 109
Joe Kington Avatar answered Sep 19 '22 04:09

Joe Kington