Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is an empty DataFrame of dtype=str filled with "n"?

I cannot understand why a Series created using dtype=str results in this:

In [2]: pandas.Series(index=range(2), dtype=str)
Out[2]: 
0    NaN
1    NaN
dtype: object

but a DataFrame created using dtype=str results in this:

In [3]: pandas.DataFrame(index=range(2), columns=[0], dtype=str)
Out[3]: 
   0
0  n
1  n

Why strings with just the letter "n"?

Why this difference between Series and DataFrame?

And where is this documented?!

like image 615
Pietro Battiston Avatar asked Mar 17 '23 03:03

Pietro Battiston


1 Answers

This is now fixed in master and shouldn't be an issue from 17.0 onwards.


In short, both DataFrames and Series create an empty NumPy array and fill it with np.nan values, but DataFrame uses the passed str dtype for this array while Series overrides it with the 'O' (object) dtype.

When no values are passed in, the __init__ method of both classes assigns an empty dictionary as the default data: data = {}.

After testing what type of object data is, the Series construction method falls back to generating an array of np.nan values but using Numpy's 'O' datatype (not the str datatype) - see here and then here:

np.empty(n, dtype='O') # later filled with np.nan

The 'O' datatype is capable of holding any type object, so np.nan causes no issues here.

DataFrame's __init__ method also ends up using np.empty and then filling the empty array with np.nan. The difference is that the specified str datatype is used (and not the 'O' datatype). The code essentially reads as follows:

v = np.empty(len(index), dtype=str)
v.fill(np.nan)

Now, when created with the str datatype, np.empty is cast to the NumPy dtype of '<U1' (i.e. one unicode character) and so v becomes:

array(['n', 'n'], dtype='<U1')

since n is the first letter of nan (np.nan is represented as just nan).

like image 167
Alex Riley Avatar answered Mar 20 '23 04:03

Alex Riley