How did Pandas data frames end up with multiple indexing methods?

Question

This seems like a simple question but I haven't been able to find any historical perspective as to when [], loc and iloc (and formerly ix) were added, and why.

The docs merely say "Object selection has had a number of user-requested additions".

n.caillou · Accepted Answer

After looking into it a bit more, my current understanding is that the culprits are the following choices :

The early design choice that "the primary function of indexing with [] is selecting out lower-dimensional slices", i.e. that for dataframes df[colname] should "return the series corresponding to colname" (from https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html). Similarly in the 2011 issue thread suggested by @nabulator below, Wesm wrote that "The idea behind DataFrame indexing was to make it a dict-like structure, first and foremost, as a collection of Series objects, so df[key] -> Series".

This is in contrast with e.g. Numpy arrays, Matlab's arrays/matrices and R's dataframes where [] allows two-dimentional slicing
Emphasis is given to Pandas Indexes (also known in Pandas as "labels"). Unlike indexes in standard Python and most other libraries and programming languages, which are positional (this more common indexing scheme is known as "integer position" in Pandas), in Pandas indexes refer to hashable keys. This is why Panda's index-related functions raise KeyError whereas functions that specifically circumvent Panda's indexes, such as .iloc, raise IndexError. This key-based scheme helps ensuring proper alignment of data slices.

However, by default, row labels are numeric, which causes confusion with positional access. Diambiguation could have happened through the Index class (e.g., df[ pd.Index([0]), ] or df[ df.index.slice(from=3, to=5), ] would be unambiguous) but given the emphasis on labels it was built directly into the DataFrame class, i.e. df.loc[0] and df.loc[3:5].

TL;DR: The early design choices in Pandas were to not treat data frames explicitly as 2D objects and to avoid position-based indexing. This didn't really stand the test of time ; the use of [] has been superceeded by the two-dimentional .loc and .iloc, and position-based indexing has crept back to dominance (.iloc, .iat and in [] itself) while the direct use of numeric row labels has become rarer

Anyway, that's what I got so far. I would welcome corrections to this perspective where appropriate

wjandrea · Answer

From a modern perspective, the reason why is because they do fundamentally different things:

[] is a convenience function for basic indexing. It's permissive while other methods are strict. In my experience, the two most common uses for indexing are selecting a column(s) and filtering rows, and this does both with the same simple syntax.

.loc[] indexes by label. E.g. if I have a dataset that covers the Canadian provincial capitals (which go mostly west to east), I can select Edmonton to Toronto just like that:

capitals.loc[:, 'Edmonton':'Toronto']

.iloc[] indexes by position. E.g. if I want to select the first two rows regardless of what their labels are (similar to .head()):

capitals.iloc[:2]

Demo data

np.random.seed(0)
capitals = pd.DataFrame(
    np.random.random((5, 10)),
    index=pd.date_range('2000', periods=5, freq='D'),
    columns=[
        'Victoria', 'Edmonton', 'Regina', 'Winnipeg', 'Toronto',
        'Quebec', 'Fredricton', 'Halifax', 'Charlottetown', 'St Johns']
)

>>> capitals.loc[:, 'Edmonton':'Toronto']
            Edmonton    Regina  Winnipeg   Toronto
2000-01-01  0.715189  0.602763  0.544883  0.423655
2000-01-02  0.528895  0.568045  0.925597  0.071036
2000-01-03  0.799159  0.461479  0.780529  0.118274
2000-01-04  0.774234  0.456150  0.568434  0.018790
2000-01-05  0.437032  0.697631  0.060225  0.666767

>>> capitals.iloc[:2]
            Victoria  Edmonton  ...   Halifax  St Johns
2000-01-01  0.548814  0.715189  ...  0.963663  0.383442
2000-01-02  0.791725  0.528895  ...  0.778157  0.870012

[2 rows x 10 columns]

And [] for reference:

>>> capitals['Victoria']  # Select a column
2000-01-01    0.548814
2000-01-02    0.791725
2000-01-03    0.978618
2000-01-04    0.264556
2000-01-05    0.359508
Freq: D, Name: Victoria, dtype: float64

>>> capitals[capitals['Victoria'] > 0.6]  # Filter rows (using a selected column)
            Victoria  Edmonton  ...  Charlottetown  St Johns
2000-01-02  0.791725  0.528895  ...       0.778157  0.870012
2000-01-03  0.978618  0.799159  ...       0.521848  0.414662

[2 rows x 10 columns]

How did Pandas data frames end up with multiple indexing methods?

Tags:

python

pandas

n.caillou

2 Answers

n.caillou

Demo data

See also

wjandrea

Recent Activity

Donate For Us

How did Pandas data frames end up with multiple indexing methods?

Tags:

python

pandas

n.caillou

2 Answers

n.caillou

Demo data

See also

wjandrea

Related questions

Recent Activity

Donate For Us