Pandas column indexing for searching?

Tags:

In relational database, we can create index on columns to speed up querying and joining on those columns. I want to do them same thing on pandas dataframe. The row index seems not what relational database offers.

The question is: Are columns in pandas indexed for searching by default?

If not, is it possible to index columns manually and how to do it?

Edit: I have read pandas docs and searched everywhere, but no one mentions indexing and searching/merging performance on pandas. Seem no one care about this issue, although it is critical in relational database. Can any one make a statement about indexing and performance on pandas?

Thanks.

297

asked Mar 07 '17 05:03

THN

1 Answers

As mentioned by @pvg - The pandas model is not that of an in memory relational databases. So, it won't help us much if we try to analogize pandas in terms of sql and it's idiosyncracies. Instead, let's look at the problem fundamentally - you're effectively trying to speed up column lookups/ joins.

You can speed up joins considerably by setting the column you wish to join by as the index in both dataframes (left and right dataframes that you wish to join) and then sorting both the indexes.

Here's an example to show you the kind of speed up you can get when joining on sorted indexes:

import pandas as pd
from numpy.random import randint

# Creating DATAFRAME #1
columns1 = ['column_1', 'column_2']
rows_df_1 = []

# generate 500 rows
# each element is a number between 0 and 100
for i in range(0,500):
    row = [randint(0,100) for x in range(0, 2)]
    rows_df_1.append(row)

df1 = pd.DataFrame(rows_df_1)
df1.columns = columns1

print(df1.head())

The first dataframe looks like this:

Out[]:    

column_1  column_2
0        83        66
1        91        12
2        49         0
3        26        75
4        84        60

Let's create the second dataframe:

columns2 = ['column_3', 'column_4']
rows_df_2 = []
# generate 500 rows
# each element is a number between 0 and 100
for i in range(0,500):
    row = [randint(0,100) for x in range(0, 2)]
    rows_df_2.append(row)

df2 = pd.DataFrame(rows_df_1)
df2.columns = columns2

The second dataframe looks like this:

Out[]:    

   column_3  column_4
0        19        26
1        78        44
2        44        43
3        95        47
4        48        59

Now let's say you wish to join these two dataframes on column_1 == column_3

# setting the join columns as indexes for each dataframe
df1 = df1.set_index('column_1')
df2 = df2.set_index('column_3')


# joining
%time
df1.join(df2)

Out[]:
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 46 ms

As you can see, just setting the join columns as dataframe indexes and joining after - takes around 46 milliseconds. Now, let's try joining *after sorting the indexes*

# sorting indexes
df1 = df1.sort_index()
df2 = df2.sort_index()

Out[]:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.78 µs

This takes around 9.78 µs, much much faster.

I believe you can apply the same sorting technique to pandas columns - sort the columns lexicographically and modify the dataframe. I haven't tested the code below, but something like this should give you a speedup on column lookups:

import numpy as np
# Lets assume df is a dataframe with thousands of columns
df = read_csv('csv_file.csv')
columns = np.sort(df.columns)

df = df[columns]

Now column lookups should be much faster - would be great if someone could test this out on a dataframe with a thousand of columns

180

answered Sep 19 '22 09:09

Shivam Gaur

Related questions
                            
                                Python module import works for one file, fails for another
                            
                                gdal_calc amin fails when passing more than 23 input files
                            
                                Setting chromedriver proxy auth with Selenium using Python
                            
                                Obtaining a prediction in Keras
                            
                                How to segment blood vessels python opencv
                            
                                Jupyter + rpy2 outputs to command prompt instead of notebook cell
                            
                                How to initialize OpenGL context with PyGame instead of GLUT
                            
                                "Can't initialize character set utf8mb4" with Windows mysql-python
                            
                                PyDev debugging: do not open "_pydev_execfile" at the end
                            
                                Automagically propagating deletion when using a bidirectional association_proxy
                            
                                Python Documentation (:obj:`str`) vs (str)
                            
                                Detect bounced emails in Python smtplib
                            
                                What's the difference between .post() , .create() and perform_create() in views.py and .create() in serializers.py
                            
                                httplib.BadStatusLine: '' on Linux but not Mac
                            
                                Initialize field only once in Python
                            
                                Pandas MultiIndex lookup with Numpy arrays
                            
                                How do you keep table rows together in python-docx?
                            
                                Enable Django admin functionality at frontend with inlines
                            
                                How to install Mayavi Trait backends?
                            
                                Which seeds have to be set where to realize 100% reproducibility of training results in tensorflow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas column indexing for searching?

Tags:

performance

python

indexing

pandas

mysql

THN

People also ask

1 Answers

Shivam Gaur

Recent Activity

Donate For Us