Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas bug: __setitem__() doesnt recognize dictionary values as a list of column names

Edit: It looks like this is a potential bug in Pandas. Check out this GitHub issue raised helpfully by @NicMoetsch noticing the unexpected behavior assigning with ditionary values has to do with a difference between frame's __setitem__() and __getitem__().


Earlier on in my code I rename some columns with a dictionary:

cols_dict = {
     'Long_column_Name': 'first_column',
     'Other_Long_Column_Name': 'second_column',
     'AnotherLongColName': 'third_column'
}
for key, val in cols_dict.items():
    df.rename(columns={key: val}, inplace=True)

(I know the loop isn't necessary here — in my actual code I'm having to search the columns of a dataframe in a list of dataframes and get a substring match for the dictionary key.)

Later on I do some clean up with applymap(), index with the dictionary values, and it works fine

pibs[cols_dict.values()].applymap(
    lambda x: np.nan if ':' in str(x) else x
)

but when I try to assign the slice back to itself, I get a key error (full error message here).

pibs[cols_dict.values()] = pibs[cols_dict.values()].applymap(
    lambda x: np.nan if ':' in str(x) else x
)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: dict_values(['first_column', 'second_column', 'third_column'])

The code runs fine if I convert the dictionary values to a list

pibs[list(cols_dict.values())] = ...

so I guess I'm just wondering why I'm able to slice with dictionary values and run applymap() on it, but I'm not able to slice with dictionary values when I turn around and try to assign the result back to the dataframe.

Put simply: why does pandas recognize cols_dict.values() as a list of column names when it's used for indexing, but not when it's used for indexing for assignment?

like image 997
semblable Avatar asked Apr 06 '21 01:04

semblable


People also ask

How can I convert panda to dictionary?

To convert pandas DataFrame to Dictionary object, use to_dict() method, this takes orient as dict by default which returns the DataFrame in format {column -> {index -> value}} . When no orient is specified, to_dict() returns in this format.

Is pandas DataFrame a dictionary?

Here is yet another example of how useful and powerful Pandas is. Pandas can create dataframes from many kinds of data structures—without you having to write lots of lengthy code. One of those data structures is a dictionary.

Is DataFrame append deprecated?

append method is deprecated and will be removed from pandas in a future version. Use pandas. concat instead.


2 Answers

The issue seems to be unrelated to the applymap(), as using aneroid's example without applymap():

import copy

cols_dict = {
     'Long_column_Name': 'first_column',
     'Other_Long_Column_Name': 'second_column',
     'AnotherLongColName': 'third_column'
}

df = pd.DataFrame({'Long_column_Name': range(3),
                   'Other_Long_Column_Name': range(3, 6),
                   'AnotherLongColName': range(15, 10, -2),
})
df.rename(columns=cols_dict, inplace=True)

df[cols_dict.values()] = df[cols_dict.values()]

yields the same error.

Obviously it's not the operation part that doesn't work, but the assignment part, as

df = df[cols_dict.values()]

works fine. Playing around with different DataFrame combinations showed that the 3 in the error message

ValueError: Wrong number of items passed 3, placement implies 1

Isn't caused by the assignment portion, as trying to assign a four-column DataFrame throws a diffrent error:

df2 = pd.DataFrame({'Long_column_Name': range(3),
                   'Other_Long_Column_Name': range(3, 6),
                   'AnotherLongColName': range(15, 10, -2),
                    'ShtClNm': range(10, 13)})

yields

ValueError: Wrong number of items passed 4, placement implies 1

Thus I tried only assigning one column so that in theory it only passes 1 item which worked fine without throwing an error.

df[cols_dict.values()] = df2['Long_column_Name']

The result however is not what was expected:

df
   first_column  second_column  third_column (first_column, second_column,third_column)  
0            0              3            15                                          0
1            1              4            13                                          1
2            1              5            11                                          2  

So to me it seems like what is happening is that pandas doesn't recognize the cols_dict.values() that is passed to df[...] = as a list of column names but instead as the name of one new column (first_column, second_column,third_column).

That's why it tries to fill that new column with the values passed for assignment. Since you passed to many (3) columns to assign to the one new column it broke.

When you use list() in df[list(cols_dict.values())] = it works fine, because it then recognizes that a list of columns is passed.

Diving deeper into pandas DataFrames, I think I've found the issue.

From my understanding, pandas uses __setitem__() for assignment and __getitem__() for look-ups. both functions make use of convert_to_index_sliceable() defined here. convert_to_index_sliceable(), which returns a slice if whatever you've passed is sliceable and Noneif it isn't.

Both __getitem__() and __setitem__() first check, whether convert_to_index_sliceable() returns None however if it doesn't return None, they differ.

__getitem__() converts the indexer to np.intp, which is numpy's indexing datetype before returning the slice as follows:

        # Do we have a slicer (on rows)?
        indexer = convert_to_index_sliceable(self, key)
        if indexer is not None:
            if isinstance(indexer, np.ndarray):
                indexer = lib.maybe_indices_to_slice(
                    indexer.astype(np.intp, copy=False), len(self)
                )
            # either we have a slice or we have a string that can be converted
            #  to a slice for partial-string date indexing
            return self._slice(indexer, axis=0)

__setitem__()on the other hand returns right away:

        # see if we can slice the rows
        indexer = convert_to_index_sliceable(self, key)
        if indexer is not None:
            # either we have a slice or we have a string that can be converted
            #  to a slice for partial-string date indexing
            return self._setitem_slice(indexer, value)

Assuming that no unnecessary code was added to __getitem__(), I think __setitem__() must be missing that code, since both pre-return comments state the exact same thing as to what indexer could possibly be.

I'm going to raise a GitHub issue asking if that is intended behavior or not.

like image 114
Nic Moetsch Avatar answered Sep 18 '22 22:09

Nic Moetsch


Not a direct answer to your question why you're able to fetch records with the dict.values() slicing but not set with it - however, it probably has to do with indexing: Because if I use loc, it works fine.

Let's set it up:

cols_dict = {
     'Long_column_Name': 'first_column',
     'Other_Long_Column_Name': 'second_column',
     'AnotherLongColName': 'third_column'
}

df = pd.DataFrame({'Long_column_Name': range(3),
                   'Other_Long_Column_Name': range(3, 6),
                   'AnotherLongColName': range(15, 10, -2),
})
df.rename(columns=cols_dict, inplace=True)
df
   first_column  second_column  third_column
0             0              3            15
1             1              4            13
2             2              5            11

An applymap to use:

df[cols_dict.values()].applymap(lambda x: -1 if x % 2 == 0 else x ** 2)
   first_column  second_column  third_column
0            -1              9           225
1             1             -1           169
2            -1             25           121

This line throws the error you got:

df[cols_dict.values()] = df[cols_dict.values()].applymap(lambda x: -1 if x % 2 == 0 else x ** 2)
# error thrown

But this works, with df.loc:

df.loc[:, cols_dict.values()] = df[cols_dict.values()].applymap(lambda x: -1 if x % 2 == 0 else x ** 2)
df
   first_column  second_column  third_column
0            -1              9           225
1             1             -1           169
2            -1             25           121

Edit, some partial inference which could be wrong: Btw, the longer error shows what else might have been happening:

KeyError: dict_values(['first_column', 'second_column', 'third_column'])

During handling of the above exception, another exception occurred:
# later:
ValueError: Wrong number of items passed 3, placement implies 1

...which has gone through a section of insert and make_block which leads me to think it was trying to create columns and failed there. And that section was invoked for setitem but not for getitem - so the lookups occurring did not have the same result. I would have instead expected the "setting with copy" error.

like image 43
aneroid Avatar answered Sep 16 '22 22:09

aneroid