I would like to solve the problem in two possible cases:
Where I don't know whether the Series of strings is going to be UTF-8 or bytes beforehand.
Where the strings in a pd.Series are mixed bytes and `UTF-8/.
Which I'd guess would have the same solution.
Currently for:
b = pd.Series(['123', '434,', 'fgd', 'aas', b'442321'])
b.str.decode('utf-8')
Gives NaNs where the strings were already in UTF-8. Or are they automatically ASCII? Can I give the error parameter in decode so that the string remains "undecoded" where it's already in UTF-8 for example? The docstring doesn't seem to provide much info.
Or is there a better way to accomplish this?
Alternatively, is there a string method in pandas like .str.decode which instead just returns a True/False when a string is bytes or UTF-8?
EDIT:
One option I can think of is:
b = pd.Series(['123', '434,', 'fgd', 'aas', b'442321'])
converted = b.str.decode('utf-8')
b.loc[~converted.isnull()] = converted
Is this the recommended way then? Seems a bit roundabout. I guess what would be more elegant is really just a way to check if an str is bytes on all the elements of a Series and return a boolean array where it's the case.
This will definitely slow things down for a large Series, but you can pass a ternary expression with a callable:
>>> b.apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)
0 123
1 434,
2 fgd
3 aas
4 442321
dtype: object
Looking at the source for .str.decode() is instructive - it just applies _na_map(f, arr) over the Series, where the function f is f = lambda x: x.decode(encoding, errors). Because str doesn't have a "decode" method to begin with, that error will become NaN. This happens in str_decode().
>>> from pandas.core.strings import str_decode
>>> from pandas.core.strings import _cpython_optimized_encoders
>>> "utf-8" in _cpython_optimized_encoders
True
>>> str_decode(b, "utf-8")
array([nan, nan, nan, nan, '442321'], dtype=object)
>>> from pandas.core.strings import _na_map
>>> f = lambda x: x.decode("utf-8")
>>> _na_map(f, b)
array([nan, nan, nan, nan, '442321'], dtype=object)
The problem still open in git
Caused by the line
except (TypeError, AttributeError): return na_value
Fix adding fillna
b.str.decode('utf-8').fillna(b)
Out[237]:
0 123
1 434,
2 fgd
3 aas
4 442321
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With