I have a large DataFrame with varied dtypes where I have to perform a global .replace to turn both NaN, NaT and empty strings into None. The DataFrame looks like
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
    'a': [n*10.0 for n in range(5)],
    'b': [datetime.now() if n%3 else None for n in range(5)],
    'c': pd.Series([f'D{n}' if n%2 else '' for n in range(5)], dtype='category'),
    'd': ['Long text chunk...' if n%3 else None for n in range(5)]
})
Which prints
      a                          b   c                   d
0   0.0                        NaT                    None
1  10.0 2020-08-13 23:35:55.533189  D1  Long text chunk...
2  20.0 2020-08-13 23:35:55.533189      Long text chunk...
3  30.0                        NaT  D3                None
4  40.0 2020-08-13 23:35:55.533189      Long text chunk...
My purpose is to bulk upload the rows into ElasticSearch, which won't accept NaN - neither NaT nor empty strings for date fields - without some setting changes I'm trying to avoid. I figured this way would be faster than individually checking every row when making the dicts.
Converting all columns to object before replacing wasn't even runnable due to the DataFrame size - I'd prefer not to convert any column at all. An approach that once worked was
df.fillna('').replace('', None)
But now, adding some category dtypes in, it raises TypeError: No matching signature found.
Searching this, nothing I found was related to pandas at all. It's clearly linked to the category dtype¹, but what I don't know:
What's the most pythonic way of doing this while keeping integrity for all columns, especially the categorical ones?
What happens behind the curtains for pandas to raise this apparently generic error in a .replace?
¹ Edit:
I later found that the pandas implementation replace in this case reaches up to a Cython-compiled method - pandas._libs.algos.pad_inplace - which expects to fill any Series dtype except category. That's why my error mentions a signature mismatch. I still wonder if this is intended behavior, as I'd expect an ffill to work especially well in categorical columns.
Since my numeric columns were filled already, I changed column a here to reflect that. So my hassle is solely the category dtype.
For one-off replace operations, it's good to avoid global conversions to object because that's costly processing-wise and memory-wise. But, as @hpaul mentioned in a comment, None is an object and not a primitive value, thus a Series must be of object type to contain it. e.g. a datetime Series will always turn None into NaT, because that's the primitive representation of the absence of a primitive date value. As is NaN for numeric dtypes and category.
Given that, I found this method to be best:
df.replace((np.nan, ''), (None, None))
As a result, we get:
      a                           b     c                   d
0   0.0                        None  None                None
1  10.0  2020-08-14 01:09:41.936421    D1  Long text chunk...
2  20.0  2020-08-14 01:09:41.936421  None  Long text chunk...
3  30.0                        None    D3                None
4  40.0  2020-08-14 01:09:41.936421  None  Long text chunk...
Due to also not relying on .astype or .fillna beforehand, this is both safer (better conversions¹) and more performant than other methods:
In [2]: %timeit -n 1000 df.replace((np.nan, ''), (None, None))
1.32 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [3]: %timeit -n 1000 df.replace({np.nan: None, '': None})
                        # ^ pandas translates this into the first call,
                        # taking a few more milliseconds
1.36 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: %timeit -n 1000 df.astype(object).where(df.notnull(), None).where(df != '', None)
2.83 ms ± 78.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
¹ pandas converts the dtypes it needs to (anything other than numerics and object itself) into object, but this method is faster because conversion is lazily done, and has the advantage of being implicitly handled by pandas. A demonstration:
In [5]: df.dtypes
a           float64
b    datetime64[ns]
c          category
d            object
dtype: object
Meanwhile, after the replace
In [6]: df.replace((np.nan, ''), (None, None)).dtypes
a    float64
b     object
c     object
d     object
dtype: object
The float64 column didn't have any empty values to replace, so it didn't change at all.
Do note this is not the same as .replace(np.nan, None).replace('', None), which would result in the same TypeError, because...
The reason this TypeError happens goes way back into pandas' Cython implementation of the default replace method, which is called pad or forward fill. But it also has to do with API choices:
pandas._libs.algos.pad_inplace) expects to fill any Series dtype except category, that's why the error mentions a signature mismatch.None as a positional argument can be misleading - pandas treats this as if "you're not passing anything as the replace value" instead of "you're passing nothing as the replace value".Notice what happens when converting the DataFrame to object and then using the same method that once worked:
In [7]: df.astype(object).fillna('').replace('', None)
      a                           b   c                   d
0
1  10.0  2020-08-13 21:18:42.520455  D1  Long text chunk...
2  20.0  2020-08-13 21:18:42.520455  D1  Long text chunk...
3  30.0  2020-08-13 21:18:42.520455  D3  Long text chunk...
4  40.0  2020-08-13 21:18:42.520455  D3  Long text chunk...
Values have been forward filled, as can be seen more easily in column c. This is because, in practice, .replace('', None) is the same as .replace(''), and pandas' API has taken the route of assuming the above is the kind of behavior sought by this operation - a plain forward fill. Except, as explained, that wouldn't work for category dtypes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With