Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an efficient method of checking whether a column has mixed dtypes?

Consider

np.random.seed(0)
s1 = pd.Series([1, 2, 'a', 'b', [1, 2, 3]])
s2 = np.random.randn(len(s1))
s3 = np.random.choice(list('abcd'), len(s1))


df = pd.DataFrame({'A': s1, 'B': s2, 'C': s3})
df
           A         B  C
0          1  1.764052  a
1          2  0.400157  d
2          a  0.978738  c
3          b  2.240893  a
4  [1, 2, 3]  1.867558  a

Column "A" has mixed data types. I would like to come up with a really quick way of determining this. It would not be as simple as checking whether type == object, because that would identify "C" as a false positive.

I can think of doing this with

df.applymap(type).nunique() > 1

A     True
B    False
C    False
dtype: bool

But calling type atop applymap is pretty slow. Especially for larger frames.

%timeit df.applymap(type).nunique() > 1
3.95 ms ± 88 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Can we do better (perhaps with NumPy)? I can accept "No" if your argument is convincing enough. :-)

like image 239
cs95 Avatar asked Dec 12 '18 15:12

cs95


3 Answers

In pandas there's infer_dtype() which might be helpful here.

Written in Cython (code link), it returns a string summarising the values in the passed object. It's used a lot in pandas' internals so we might reasonably expect that's it has been designed with efficiency in mind.

>>> from pandas.api.types import infer_dtype

Now, column A is a mix of integers and some other types:

>>> infer_dtype(df.A)
'mixed-integer'

Column B's values are all of floating type:

>>> infer_dtype(df.B)
'floating'

Column C contains strings:

>>> infer_dtype(df.B)
'string'

The general "catchall" type for mixed values is simply "mixed":

>>> infer_dtype(['a string', pd.Timedelta(10)])
'mixed'

A mix of floats and integers is ''mixed-integer-float'':

>>> infer_dtype([3.141, 99])
'mixed-integer-float'

To make the function you describe in your question, one approach could be to create a function which catches the relevant mixed cases:

def is_mixed(col):
    return infer_dtype(col) in ['mixed', 'mixed-integer']

Then you have:

>>> df.apply(is_mixed)
A     True
B    False
C    False
dtype: bool
like image 178
Alex Riley Avatar answered Nov 16 '22 06:11

Alex Riley


Here is an approach that uses the fact that in Python3 different types cannot be compared. The idea is to run max over the array which being a builtin should be reasonably fast. And it does short-cicuit.

def ismixed(a):
    try:
        max(a)
        return False
    except TypeError as e: # we take this to imply mixed type
        msg, fst, and_, snd = str(e).rsplit(' ', 3)
        assert msg=="'>' not supported between instances of"
        assert and_=="and"
        assert fst!=snd
        return True
    except ValueError as e: # catch empty arrays
        assert str(e)=="max() arg is an empty sequence"
        return False

It doesn't catch mixed numeric types, though. Also, objects that just do not support comparison may trip this up.

But it's reasonably fast. If we strip away all pandas overhead:

v = df.values

list(map(is_mixed, v.T))
# [True, False, False]
timeit(lambda: list(map(ismixed, v.T)), number=1000)
# 0.008936170022934675

For comparison

timeit(lambda: list(map(infer_dtype, v.T)), number=1000)
# 0.02499613002873957
like image 28
Paul Panzer Avatar answered Nov 16 '22 06:11

Paul Panzer


Not sure how you need the result, but you can map the type to df.values.ravel() and create a dictionary of the name of the column link to the comparison of the len of a set superior to 1 for each slice of the l such as:

l = list(map(type, df.values.ravel()))
print ({df.columns[i]:len(set(l[i::df.shape[1]])) > 1 for i in range(df.shape[1])})
{'A': True, 'B': False, 'C': False}

Timing:

%timeit df.applymap(type).nunique() > 1
#3.25 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit 
l = list(map(type, df.values.ravel()))
{df.columns[i]:len(set(l[i::df.shape[1]])) > 1 for i in range(df.shape[1])}
#100 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

EDIT for larger dataframe, the improve in time is less interesting though:

dfl = pd.concat([df]*100000,ignore_index=True)

%timeit dfl.applymap(type).nunique() > 1
#519 ms ± 61.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
l = list(map(type, dfl.values.ravel()))
{dfl.columns[i]:len(set(l[i::dfl.shape[1]])) > 1 for i in range(dfl.shape[1])}
#254 ms ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A bit faster solution on the same idea:

%timeit { col: len(set(map(type, dfl[col])))>1 for col in dfl.columns}
#124 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 40
Ben.T Avatar answered Nov 16 '22 07:11

Ben.T