When you import data into a Pandas DataFrame, Pandas by default tries to know the data types of each column. But Object dtype have a much broader scope. They can not only include strings, but also any other data that Pandas doesn't understand.
The dtypes property is used to find the dtypes in the DataFrame. This returns a Series with the data type of each column. The result's index is the original DataFrame's columns. Columns with mixed types are stored with the object dtype.
We can see here that by default, Pandas will store strings using the object datatype. The object data type is used for strings and for mixed data types, but it's not particularly explicit. What is this? Beginning in version 1.0, Pandas has had a dedicated string datatype.
Pandas uses the object dtype for storing strings.
The dtype
object comes from NumPy, it describes the type of element in a ndarray
. Every element in an ndarray
must have the same size in bytes. For int64
and float64
, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray
directly, Pandas uses an object ndarray
, which saves pointers to objects; because of this the dtype
of this kind ndarray
is object.
Here is an example:
@HYRY's answer is great. I just want to provide a little more context..
Arrays store data as contiguous, fixed-size memory blocks. The combination of these properties together is what makes arrays lightning fast for data access. For example, consider how your computer might store an array of 32-bit integers, [3,0,1]
.
If you ask your computer to fetch the 3rd element in the array, it'll start at the beginning and then jump across 64 bits to get to the 3rd element. Knowing exactly how many bits to jump across is what makes arrays fast.
Now consider the sequence of strings ['hello', 'i', 'am', 'a', 'banana']
. Strings are objects that vary in size, so if you tried to store them in contiguous memory blocks, it'd end up looking like this.
Now your computer doesn't have a fast way to access a randomly requested element. The key to overcoming this is to use pointers. Basically, store each string in some random memory location, and fill the array with the memory address of each string. (Memory addresses are just integers.) So now, things look like this
Now, if you ask your computer to fetch the 3rd element, just as before, it can jump across 64 bits (assuming the memory addresses are 32-bit integers) and then make one extra step to go fetch the string.
The challenge for NumPy is that there's no guarantee the pointers are actually pointing to strings. That's why it reports the dtype as 'object'.
Shamelessly gonna plug my own course on NumPy where I originally discussed this.
The accepted answer is good. I just wanted to reference the documentation. The documentation says:
Pandas uses the object dtype for storing strings.
The accepted answer did a great job explaining the "why"; strings are variable-length:
But for strings, the length of the string is not fixed.
But as the leading comment on the accepted answer once said : "Don't worry about it; it's supposed to be like this."
As of version 1.0.0 (January 2020), pandas has introduced as an experimental feature providing first-class support for string types through pandas.StringDtype
.
While you'll still be seeing object
by default, the new type can be used by specifying a dtype
of pd.StringDtype
or simply 'string'
:
>>> pd.Series(['abc', None, 'def'])
0 abc
1 None
2 def
dtype: object
>>> pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())
0 abc
1 <NA>
2 def
dtype: string
>>> pd.Series(['abc', None, 'def']).astype('string')
0 abc
1 <NA>
2 def
dtype: string
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With