Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the Pandas deal with the situation when a column with type "object" is compared with an integer?

My question is about the rule that pandas uses to compare a column with type "object" with an integer. Here is my code:

In [334]: df
Out[334]: 
     c1    c2        c3  c4
id1   1    li -0.367860   5
id2   2  zhao -0.596926   5
id3   3   sun  0.493806   5
id4   4  wang -0.311407   5
id5   5  wang  0.253646   5

In [335]: df < 2
Out[335]: 
        c1    c2    c3     c4
id1   True  True  True  False
id2  False  True  True  False
id3  False  True  True  False
id4  False  True  True  False
id5  False  True  True  False

In [336]: df.dtypes
Out[336]: 
c1      int64
c2     object
c3    float64
c4      int64
dtype: object

Why does the "c2" column get True for all?

P.S. I also tried:

In [333]: np.less(np.array(["s","b"]),2)
Out[333]: NotImplemented
like image 238
BO.LI Avatar asked Aug 18 '18 12:08

BO.LI


People also ask

Which pandas method will convert a column type from object to float even if there are invalid numbers in that column?

Using astype() astype() method is used to cast a pandas column to the specified dtype.

Can pandas column have different data types?

Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .

How does pandas determine data type?

To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.

Which pandas method will convert a column type from object to float?

By using pandas DataFrame. astype() and pandas. to_numeric() methods you can convert a column from string/int type to float.

What is an object data type in pandas?

The object data type is a special one. According to the Pandas Cookbook, the object data type is “a catch-all for columns that Pandas doesn’t recognize as any other specific type.” In practice, it often means that all of the values in the column are strings.

What is the difference between Python Dataframe and pandas?

While you can put anything into a list, the columns of a DataFrame contain values of a specific data type. When you compare Pandas and Python data structures, you’ll see that this behavior makes Pandas much faster! You can display all columns and their data types with .info ():

How does pandas assign data types to columns?

When you create a new DataFrame, either by calling a constructor or reading a CSV file, Pandas assigns a data type to each column based on its values. While it does a pretty good job, it’s not perfect.

Is it possible to coerce Str to object in pandas?

However pandas seem to lack that distinction and coerce str to object. :: Forcing the type to dtype ('S') does not help either. :: Is there any explanation for this behavior?


1 Answers

For DataFrames, comparison with a scalar always returns a DataFrame having all Boolean columns.

I don't think it's documented anywhere officially, but there's a comment in the source code (see below) confirming the intended behaviour:

[for] straight boolean comparisons [between a DataFrame and a scalar] we want to allow all columns (regardless of dtype to pass thru) See #4537 for discussion.

In practice, this means that all comparisons for every column must return either True or False. Any invalid comparison (such as 'li' < 2) should default to one of these Boolean values.

Put simply, the pandas developers decided that it should default to True.

There's some discussion of this behaviour in #4537 and some argument to use False instead, or restrict the comparison to only columns with compatible types, but the ticket was closed and no code was changed.

If you're interested, you can see where the default value is used for invalid comparisons in an internal method found in ops.py:

def _comp_method_FRAME(cls, func, special):
    str_rep = _get_opstr(func, cls)
    op_name = _get_op_name(func, special)

    @Appender('Wrapper for comparison method {name}'.format(name=op_name))
    def f(self, other):
        if isinstance(other, ABCDataFrame):
            # Another DataFrame
            if not self._indexed_same(other):
                raise ValueError('Can only compare identically-labeled '
                                 'DataFrame objects')
            return self._compare_frame(other, func, str_rep)

        elif isinstance(other, ABCSeries):
            return _combine_series_frame(self, other, func,
                                         fill_value=None, axis=None,
                                         level=None, try_cast=False)
        else:

            # straight boolean comparisons we want to allow all columns
            # (regardless of dtype to pass thru) See #4537 for discussion.
            res = self._combine_const(other, func,
                                      errors='ignore',
                                      try_cast=False)
            return res.fillna(True).astype(bool)

    f.__name__ = op_name
    return f

The else block is the one we're interested in for the scalar case.

Note the errors='ignore' argument, meaning an invalid comparison will return NaN (instead of raising an error). The res.fillna(True) fills these failed comparisons with True.

like image 143
Alex Riley Avatar answered Nov 02 '22 08:11

Alex Riley