Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame: complete spec for __getitem__()? [closed]

Short version

For pandas Dataframe.__getitem__(), what are the allowed inputs (input types really), and what results does the function produce as a result?

Details

Description of problem

I would like to write code that makes full use of DataFrame[], essentially Dataframe.__getitem__(). To that end, I would like information on inputs/return results, at the level of detail found on the API page, though not available there for this method.

What has been done so far to solve it

I looked for a complete spec for that function at the Pandas API page. Though many other methods are documented, Dataframe.__getitem__() is not.

I also looked at the tutorial, but I don't believe that's attempting to be exhaustive.

I did look at the source code for Dataframe.__getitem__() (first pass at this described in my own answer below). It's evident that a variety of quite different types can be accepted as input, but reverse engineering the code to determine what happens when each of those types is passed seems like it can't be the intended way to master this method.

Additional background

Pandas is one of the most important libraries in Python's role in science and statistics, DataFrame is arguably the most central object in Pandas, and the [] operator is arguably the most central method on DataFrame. Hence, actually answering the question I have posted here has a very high pedagogical value, not just some utility for me.

like image 928
gwideman Avatar asked Jan 18 '15 12:01

gwideman


People also ask

How do I get rid of Settingswithcopywarning in Python?

One approach that can be used to suppress SettingWithCopyWarning is to perform the chained operations into just a single loc operation. This will ensure that the assignment happens on the original DataFrame instead of a copy. Therefore, if we attempt doing so the warning should no longer be raised.

Is Loc faster?

loc" performance is better and gradually it is become slower by increasing the number of records in loop.


1 Answers

I'm suspecting part of the lack of doc for this function is due to lack of doc comments in the source, now that I look at it. In case nobody comes up with anything more user-friendly, here's the actual DataFrame.__getitem__() method:

def __getitem__(self, key):

    # shortcut if we are an actual column
    is_mi_columns = isinstance(self.columns, MultiIndex)
    try:
        if key in self.columns and not is_mi_columns:
            return self._getitem_column(key)
    except:
        pass

    # see if we can slice the rows
    indexer = _convert_to_index_sliceable(self, key)
    if indexer is not None:
        return self._getitem_slice(indexer)

    if isinstance(key, (Series, np.ndarray, list)):
        # either boolean or fancy integer index
        return self._getitem_array(key)
    elif isinstance(key, DataFrame):
        return self._getitem_frame(key)
    elif is_mi_columns:
        return self._getitem_multilevel(key)
    else:
        return self._getitem_column(key)

... which at least gives a top-level breakdown of the kinds of key (index) that DataFrame[] accepts.

like image 79
gwideman Avatar answered Sep 19 '22 08:09

gwideman