I cannot understand why a Series created using dtype=str
results in this:
In [2]: pandas.Series(index=range(2), dtype=str)
Out[2]:
0 NaN
1 NaN
dtype: object
but a DataFrame created using dtype=str
results in this:
In [3]: pandas.DataFrame(index=range(2), columns=[0], dtype=str)
Out[3]:
0
0 n
1 n
Why strings with just the letter "n"?
Why this difference between Series and DataFrame?
And where is this documented?!
This is now fixed in master and shouldn't be an issue from 17.0 onwards.
In short, both DataFrames and Series create an empty NumPy array and fill it with np.nan
values, but DataFrame uses the passed str
dtype for this array while Series overrides it with the 'O'
(object) dtype.
When no values are passed in, the __init__
method of both classes assigns an empty dictionary as the default data: data = {}
.
After testing what type of object data
is, the Series construction method falls back to generating an array of np.nan
values but using Numpy's 'O'
datatype (not the str
datatype) - see here and then here:
np.empty(n, dtype='O') # later filled with np.nan
The 'O'
datatype is capable of holding any type object, so np.nan
causes no issues here.
DataFrame's __init__
method also ends up using np.empty
and then filling the empty array with np.nan
. The difference is that the specified str
datatype is used (and not the 'O'
datatype). The code essentially reads as follows:
v = np.empty(len(index), dtype=str)
v.fill(np.nan)
Now, when created with the str
datatype, np.empty
is cast to the NumPy dtype
of '<U1'
(i.e. one unicode character) and so v
becomes:
array(['n', 'n'], dtype='<U1')
since n
is the first letter of nan
(np.nan
is represented as just nan
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With