What is the point of indexing in pandas?

Tags:

Can someone point me to a link or provide an explanation of the benefits of indexing in pandas? I routinely deal with tables and join them based on columns, and this joining/merging process seems to re-index things anyway, so it's a bit cumbersome to apply index criteria considering I don't think I need to.

Any thoughts on best-practices around indexing?

614

asked Dec 01 '14 21:12

user3659451

1 Answers

Like a dict, a DataFrame's index is backed by a hash table. Looking up rows based on index values is like looking up dict values based on a key.

In contrast, the values in a column are like values in a list.

Looking up rows based on index values is faster than looking up rows based on column values.

For example, consider

df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)}) df_with_index = df.set_index(['index'])

Here is how you could look up any row where the df['index'] column equals 999. Pandas has to loop through every value in the column to find the ones equal to 999.

df[df['index'] == 999]  #           foo  index # 999  0.375489    999

Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:

df_with_index.loc[999] # foo        0.375489 # index    999.000000 # Name: 999, dtype: float64

Looking up rows by index is much faster than looking up rows by column value:

In [254]: %timeit df[df['index'] == 999] 1000 loops, best of 3: 368 µs per loop  In [255]: %timeit df_with_index.loc[999] 10000 loops, best of 3: 57.7 µs per loop

Note however, it takes time to build the index:

In [220]: %timeit df.set_index(['index']) 1000 loops, best of 3: 330 µs per loop

So having the index is only advantageous when you have many lookups of this type to perform.

Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index, stack, unstack, pivot, pivot_table, melt, lreshape, and crosstab, all use or manipulate the index. Sometimes we want the DataFrame in a different shape for presentation purposes, or for join, merge or groupby operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join, merge and groupby take advantage of fast index lookups when possible.

Time series have resample, asfreq and interpolate methods whose underlying implementations take advantage of fast index lookups too.

So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash lookups.

143

answered Oct 12 '22 01:10

unutbu

Related questions
                            
                                Is it Pythonic to use bools as ints?
                            
                                How do I convert an array (i.e. list) column to Vector
                            
                                Using Flask-SQLAlchemy in Blueprint models without reference to the app [closed]
                            
                                Why is a.insert(0,0) much slower than a[0:0]=[0]?
                            
                                String concatenation without '+' operator
                            
                                Why can't I find any pywin32 documentation/resources
                            
                                What is time complexity of a list to set conversion?
                            
                                Killing a process created with Python's subprocess.Popen() [duplicate]
                            
                                Python type hints and context managers
                            
                                "getaddrinfo failed", what does that mean?
                            
                                How can I use Bootstrap with Django?
                            
                                Ensuring py.test includes the application directory in sys.path
                            
                                TypeError: super() takes at least 1 argument (0 given) error is specific to any python version?
                            
                                What is __path__ useful for?
                            
                                Close pre-existing figures in matplotlib when running from eclipse
                            
                                How can I dynamically create class methods for a class in python [duplicate]
                            
                                Django Admin: OneToOne Relation as an Inline?
                            
                                PyQt or PySide - which one to use [closed]
                            
                                How to add trendline in python matplotlib dot (scatter) graphs?
                            
                                What refactoring tools do you use for Python? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the point of indexing in pandas?

Tags:

python

indexing

pandas

user3659451

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us