Can someone point me to a link or provide an explanation of the benefits of indexing in pandas? I routinely deal with tables and join them based on columns, and this joining/merging process seems to re-index things anyway, so it's a bit cumbersome to apply index criteria considering I don't think I need to.
Any thoughts on best-practices around indexing?
Indexing in Python is a way to refer the individual items within an iterable by its position. In other words, you can directly access your elements of choice within an iterable and do various operations depending on your needs.
Index become more important in time series data. Visualisations also need good control over pandas index. Index is like an address, that's how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names.
Pandas DataFrame – Get Index The DataFrame. index property returns an Index object representing the index of this DataFrame. The syntax to use index property of a DataFrame is DataFrame.index. The index property returns an object of type Index. We could access individual index using any looping technique in Python.
Numpy goes by rows and columns (rows first, because an element (i, j) of a matrix denotes the i th row and j th column), while Pandas works based on the columns of a database, inside which you choose elements, i.e. rows. Of course you can work directly on indices by using iloc , as you mentioned.
Like a dict, a DataFrame's index is backed by a hash table. Looking up rows based on index values is like looking up dict values based on a key.
In contrast, the values in a column are like values in a list.
Looking up rows based on index values is faster than looking up rows based on column values.
For example, consider
df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)}) df_with_index = df.set_index(['index'])
Here is how you could look up any row where the df['index']
column equals 999. Pandas has to loop through every value in the column to find the ones equal to 999.
df[df['index'] == 999] # foo index # 999 0.375489 999
Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:
df_with_index.loc[999] # foo 0.375489 # index 999.000000 # Name: 999, dtype: float64
Looking up rows by index is much faster than looking up rows by column value:
In [254]: %timeit df[df['index'] == 999] 1000 loops, best of 3: 368 µs per loop In [255]: %timeit df_with_index.loc[999] 10000 loops, best of 3: 57.7 µs per loop
Note however, it takes time to build the index:
In [220]: %timeit df.set_index(['index']) 1000 loops, best of 3: 330 µs per loop
So having the index is only advantageous when you have many lookups of this type to perform.
Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index
, stack
, unstack
, pivot
, pivot_table
, melt
, lreshape
, and crosstab
, all use or manipulate the index. Sometimes we want the DataFrame in a different shape for presentation purposes, or for join
, merge
or groupby
operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join
, merge
and groupby
take advantage of fast index lookups when possible.
Time series have resample
, asfreq
and interpolate
methods whose underlying implementations take advantage of fast index lookups too.
So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash lookups.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With