I'm wondering if there is a more efficient way to do an "index & match" type function that is popular in excel. For example - given two pandas DataFrames, update the df_1 with information found in df_2: <pre class="prettyprint"><code>import pandas as pd df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5], 'num_b':[2, 4, 1, 2, 3]}) df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5], 'name':['a', 'b', 'c', 'd', 'e']}) </code></pre> I'm working with data sets that have ~80,000 rows in both df_1 and df_2 and my goal is to create two new columns in df_1, "name_a" and "name_b". Below is the most efficient method that I could come up with. There has to be a better way! <pre class="prettyprint"><code>name_a = [] name_b = [] for i in range(len(df_1)): name_a.append(df_2.name.iloc[df_2[ df_2.num == df_1.num_a.iloc[i]].index[0]]) name_b.append(df_2.name.iloc[df_2[ df_2.num == df_1.num_b.iloc[i]].index[0]]) df_1['name_a'] = name_a df_1['name_b'] = name_b </code></pre> Resulting in: <pre class="prettyprint"><code>>>> df_1.head() num_a num_b name_a name_b 0 1 2 a b 1 2 4 b d 2 3 1 c a 3 4 2 d b 4 5 3 e c </code></pre>

High Level <ul> <li>Create a dictionary to use in a <code>replace</code> </li> <li> <code>replace</code>, <code>rename</code> columns, and <code>join</code> </li> </ul> <hr> <pre class="prettyprint"><code>m = dict(zip( df_2.num.values.tolist(), df_2.name.values.tolist() )) df_1.join( df_1.replace(m).rename( columns=lambda x: x.replace('num', 'name') ) ) num_a num_b name_a name_b 0 1 2 a b 1 2 4 b d 2 3 1 c a 3 4 2 d b 4 5 3 5 c </code></pre> <hr> Breakdown <code>replace</code> with a dictionary should be pretty quick. There are bunch of ways to build a dictionary form <code>df_2</code>. As a matter of fact we could have used a <code>pd.Series</code>. I chose to build with <code>dict</code> and <code>zip</code> because I find that it's faster. Building <code>m</code> Option 1 <pre class="prettyprint"><code>m = df_2.set_index('num').name </code></pre> Option 2 <pre class="prettyprint"><code>m = df_2.set_index('num').name.to_dict() </code></pre> Option 3 <pre class="prettyprint"><code>m = dict(zip(df_2.num, df_2.name)) </code></pre> Option 4 (My Choice) <pre class="prettyprint"><code>m = dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist())) </code></pre> <code>m</code> build times <pre class="prettyprint"><code>1000 loops, best of 3: 325 µs per loop 1000 loops, best of 3: 376 µs per loop 10000 loops, best of 3: 32.9 µs per loop 100000 loops, best of 3: 10.4 µs per loop %timeit df_2.set_index('num').name %timeit df_2.set_index('num').name.to_dict() %timeit dict(zip(df_2.num, df_2.name)) %timeit dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist())) </code></pre> <hr> Replacing <code>num</code> Again, we have choices, here are a few and their times. <pre class="prettyprint"><code>%timeit df_1.replace(m) %timeit df_1.applymap(lambda x: m.get(x, x)) %timeit df_1.stack().map(lambda x: m.get(x, x)).unstack() 1000 loops, best of 3: 792 µs per loop 1000 loops, best of 3: 959 µs per loop 1000 loops, best of 3: 925 µs per loop </code></pre> I choose... <pre class="prettyprint"><code>df_1.replace(m) num_a num_b 0 a b 1 b d 2 c a 3 d b 4 5 c </code></pre> Rename columns <pre class="prettyprint"><code>df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name')) name_a name_b <-- note the column name change 0 a b 1 b d 2 c a 3 d b 4 5 c </code></pre> Join <pre class="prettyprint"><code>df_1.join(df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name'))) num_a num_b name_a name_b 0 1 2 a b 1 2 4 b d 2 3 1 c a 3 4 2 d b 4 5 3 5 c </code></pre>

pandas dataframe index match

Tags:

python

indexing

pandas

dataframe

I'm wondering if there is a more efficient way to do an "index & match" type function that is popular in excel. For example - given two pandas DataFrames, update the df_1 with information found in df_2:

import pandas as pd

df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
                     'num_b':[2, 4, 1, 2, 3]})    
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
                     'name':['a', 'b', 'c', 'd', 'e']})

I'm working with data sets that have ~80,000 rows in both df_1 and df_2 and my goal is to create two new columns in df_1, "name_a" and "name_b".

Below is the most efficient method that I could come up with. There has to be a better way!

name_a = []
name_b = []
for i in range(len(df_1)):

    name_a.append(df_2.name.iloc[df_2[
                  df_2.num == df_1.num_a.iloc[i]].index[0]])
    name_b.append(df_2.name.iloc[df_2[
                  df_2.num == df_1.num_b.iloc[i]].index[0]])

df_1['name_a'] = name_a
df_1['name_b'] = name_b

Resulting in:

>>> df_1.head()
   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c

228

asked Jun 02 '17 00:06

A. Martin

2 Answers

High Level

Create a dictionary to use in a replace
replace, rename columns, and join

m = dict(zip(
    df_2.num.values.tolist(),
    df_2.name.values.tolist()
))

df_1.join(
    df_1.replace(m).rename(
        columns=lambda x: x.replace('num', 'name')
    )
)

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      5      c

Breakdown

replace with a dictionary should be pretty quick. There are bunch of ways to build a dictionary form df_2. As a matter of fact we could have used a pd.Series. I chose to build with dict and zip because I find that it's faster.

Building m

Option 1

m = df_2.set_index('num').name

Option 2

m = df_2.set_index('num').name.to_dict()

Option 3

m = dict(zip(df_2.num, df_2.name))

Option 4 (My Choice)

m = dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))

m build times

1000 loops, best of 3: 325 µs per loop
1000 loops, best of 3: 376 µs per loop
10000 loops, best of 3: 32.9 µs per loop
100000 loops, best of 3: 10.4 µs per loop

%timeit df_2.set_index('num').name
%timeit df_2.set_index('num').name.to_dict()
%timeit dict(zip(df_2.num, df_2.name))
%timeit dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))

Replacing num

Again, we have choices, here are a few and their times.

%timeit df_1.replace(m)
%timeit df_1.applymap(lambda x: m.get(x, x))
%timeit df_1.stack().map(lambda x: m.get(x, x)).unstack()

1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 959 µs per loop
1000 loops, best of 3: 925 µs per loop

I choose...

df_1.replace(m)

  num_a num_b
0     a     b
1     b     d
2     c     a
3     d     b
4     5     c

Rename columns

df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name'))

  name_a name_b   <-- note the column name change
0      a      b
1      b      d
2      c      a
3      d      b
4      5      c

Join

df_1.join(df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name')))

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      5      c

answered Oct 15 '22 10:10

piRSquared

I think there's a more straightforward solution than those already offered. Since you mentioned Excel, this is a basic vlookup. You can simulate this in pandas by using Series.map.

name_map = dict(df_2.set_index('num').name)

df_1['name_a'] = df_1.num_a.map(name_map)
df_1['name_b'] = df_1.num_b.map(name_map)

df_1

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c

All we do is convert df_2 to a dict with 'num' as the keys. The map function looks up each value from a df_1 column in the dict and returns the corresponding letter. No complicated indexing required.

answered Oct 15 '22 10:10

T. Ray

Related questions
                            
                                asyncio CancelledError and KeyboardInterrupt
                            
                                How to create python classes in Jupyter Notebook
                            
                                How do I flatten a pandas dataframe keeping index and column names
                            
                                rename certain value in pandas series
                            
                                pdfkit - An A4 html page does not print into an A4 pdf
                            
                                How to install graphviz in Ubuntu 15 to plot a decision tree for XGBoost?
                            
                                Index JSON files in elasticsearch using Python?
                            
                                Python Gevent Pywsgi server with ssl
                            
                                How to wait for RxPy parallel threads to complete
                            
                                Apply migrations and models from all the apps
                            
                                Apply seaborn heatmap columnwise on pandas dataframe
                            
                                Calculate histograms along axis
                            
                                How to shuffle groups of rows of a Pandas dataframe?
                            
                                Installing a python package that is not available in anaconda (smtplib)
                            
                                How do I get a per mille sign in my axis title using Latex in matplotlib?
                            
                                Text to Binary in Python
                            
                                How to check if there's any odd/even numbers in an Iterable (e.g. list/tuple)?
                            
                                How to Install/add jdk 7 in Docker Container
                            
                                speed up pandas apply or using map
                            
                                What is the most efficient way to compute a Kronecker Product in TensorFlow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With