Consider
np.random.seed(0)
s1 = pd.Series([1, 2, 'a', 'b', [1, 2, 3]])
s2 = np.random.randn(len(s1))
s3 = np.random.choice(list('abcd'), len(s1))
df = pd.DataFrame({'A': s1, 'B': s2, 'C': s3})
df
A B C
0 1 1.764052 a
1 2 0.400157 d
2 a 0.978738 c
3 b 2.240893 a
4 [1, 2, 3] 1.867558 a
Column "A" has mixed data types. I would like to come up with a really quick way of determining this. It would not be as simple as checking whether type == object
, because that would identify "C" as a false positive.
I can think of doing this with
df.applymap(type).nunique() > 1
A True
B False
C False
dtype: bool
But calling type
atop applymap
is pretty slow. Especially for larger frames.
%timeit df.applymap(type).nunique() > 1
3.95 ms ± 88 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Can we do better (perhaps with NumPy)? I can accept "No" if your argument is convincing enough. :-)
In pandas there's infer_dtype()
which might be helpful here.
Written in Cython (code link), it returns a string summarising the values in the passed object. It's used a lot in pandas' internals so we might reasonably expect that's it has been designed with efficiency in mind.
>>> from pandas.api.types import infer_dtype
Now, column A is a mix of integers and some other types:
>>> infer_dtype(df.A)
'mixed-integer'
Column B's values are all of floating type:
>>> infer_dtype(df.B)
'floating'
Column C contains strings:
>>> infer_dtype(df.B)
'string'
The general "catchall" type for mixed values is simply "mixed":
>>> infer_dtype(['a string', pd.Timedelta(10)])
'mixed'
A mix of floats and integers is ''mixed-integer-float'':
>>> infer_dtype([3.141, 99])
'mixed-integer-float'
To make the function you describe in your question, one approach could be to create a function which catches the relevant mixed cases:
def is_mixed(col):
return infer_dtype(col) in ['mixed', 'mixed-integer']
Then you have:
>>> df.apply(is_mixed)
A True
B False
C False
dtype: bool
Here is an approach that uses the fact that in Python3 different types cannot be compared. The idea is to run max
over the array which being a builtin should be reasonably fast. And it does short-cicuit.
def ismixed(a):
try:
max(a)
return False
except TypeError as e: # we take this to imply mixed type
msg, fst, and_, snd = str(e).rsplit(' ', 3)
assert msg=="'>' not supported between instances of"
assert and_=="and"
assert fst!=snd
return True
except ValueError as e: # catch empty arrays
assert str(e)=="max() arg is an empty sequence"
return False
It doesn't catch mixed numeric types, though. Also, objects that just do not support comparison may trip this up.
But it's reasonably fast. If we strip away all pandas
overhead:
v = df.values
list(map(is_mixed, v.T))
# [True, False, False]
timeit(lambda: list(map(ismixed, v.T)), number=1000)
# 0.008936170022934675
For comparison
timeit(lambda: list(map(infer_dtype, v.T)), number=1000)
# 0.02499613002873957
Not sure how you need the result, but you can map
the type
to df.values.ravel()
and create a dictionary of the name of the column link to the comparison of the len
of a set
superior to 1 for each slice of the l
such as:
l = list(map(type, df.values.ravel()))
print ({df.columns[i]:len(set(l[i::df.shape[1]])) > 1 for i in range(df.shape[1])})
{'A': True, 'B': False, 'C': False}
Timing:
%timeit df.applymap(type).nunique() > 1
#3.25 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
l = list(map(type, df.values.ravel()))
{df.columns[i]:len(set(l[i::df.shape[1]])) > 1 for i in range(df.shape[1])}
#100 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
EDIT for larger dataframe, the improve in time is less interesting though:
dfl = pd.concat([df]*100000,ignore_index=True)
%timeit dfl.applymap(type).nunique() > 1
#519 ms ± 61.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
l = list(map(type, dfl.values.ravel()))
{dfl.columns[i]:len(set(l[i::dfl.shape[1]])) > 1 for i in range(dfl.shape[1])}
#254 ms ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A bit faster solution on the same idea:
%timeit { col: len(set(map(type, dfl[col])))>1 for col in dfl.columns}
#124 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With