In relational database, we can create index on columns to speed up querying and joining on those columns. I want to do them same thing on pandas dataframe. The row index seems not what relational database offers.
The question is: Are columns in pandas indexed for searching by default?
If not, is it possible to index columns manually and how to do it?
Edit: I have read pandas docs and searched everywhere, but no one mentions indexing and searching/merging performance on pandas. Seem no one care about this issue, although it is critical in relational database. Can any one make a statement about indexing and performance on pandas?
Thanks.
In order to set index to column in pandas DataFrame use reset_index() method. By using this you can also set single, multiple indexes to a column. If you are not aware by default, pandas adds an index to each row of the pandas DataFrame.
Boolean indexing is defined as a very important feature of numpy, which is frequently used in pandas. Its main task is to use the actual values of the data in the DataFrame. We can filter the data in the boolean indexing in different ways, which are as follows: Access the DataFrame with a boolean index.
In Pandas, DataFrame. loc[] property is used to get a specific cell value by row & label name(column name).
As mentioned by @pvg - The pandas model is not that of an in memory relational databases. So, it won't help us much if we try to analogize pandas in terms of sql and it's idiosyncracies. Instead, let's look at the problem fundamentally - you're effectively trying to speed up column lookups/ joins.
You can speed up joins considerably by setting the column you wish to join by as the index in both dataframes (left and right dataframes that you wish to join) and then sorting both the indexes.
Here's an example to show you the kind of speed up you can get when joining on sorted indexes:
import pandas as pd
from numpy.random import randint
# Creating DATAFRAME #1
columns1 = ['column_1', 'column_2']
rows_df_1 = []
# generate 500 rows
# each element is a number between 0 and 100
for i in range(0,500):
row = [randint(0,100) for x in range(0, 2)]
rows_df_1.append(row)
df1 = pd.DataFrame(rows_df_1)
df1.columns = columns1
print(df1.head())
The first dataframe looks like this:
Out[]:
column_1 column_2
0 83 66
1 91 12
2 49 0
3 26 75
4 84 60
Let's create the second dataframe:
columns2 = ['column_3', 'column_4']
rows_df_2 = []
# generate 500 rows
# each element is a number between 0 and 100
for i in range(0,500):
row = [randint(0,100) for x in range(0, 2)]
rows_df_2.append(row)
df2 = pd.DataFrame(rows_df_1)
df2.columns = columns2
The second dataframe looks like this:
Out[]:
column_3 column_4
0 19 26
1 78 44
2 44 43
3 95 47
4 48 59
Now let's say you wish to join these two dataframes on column_1 == column_3
# setting the join columns as indexes for each dataframe
df1 = df1.set_index('column_1')
df2 = df2.set_index('column_3')
# joining
%time
df1.join(df2)
Out[]:
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 46 ms
As you can see, just setting the join columns as dataframe indexes and joining after - takes around 46 milliseconds. Now, let's try joining *after sorting the indexes*
# sorting indexes
df1 = df1.sort_index()
df2 = df2.sort_index()
Out[]:
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.78 µs
This takes around 9.78 µs, much much faster.
I believe you can apply the same sorting technique to pandas columns - sort the columns lexicographically and modify the dataframe. I haven't tested the code below, but something like this should give you a speedup on column lookups:
import numpy as np
# Lets assume df is a dataframe with thousands of columns
df = read_csv('csv_file.csv')
columns = np.sort(df.columns)
df = df[columns]
Now column lookups should be much faster - would be great if someone could test this out on a dataframe with a thousand of columns
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With