I have a <code>df</code> which contains my main data which has one million <code>rows</code>. My main data also has 30 <code>columns</code>. Now I want to add another column to my <code>df</code> called <code>category</code>. The <code>category</code> is a <code>column</code> in <code>df2</code> which contains around 700 <code>rows</code> and two other <code>columns</code> that will match with two <code>columns</code> in <code>df</code>. I begin with setting an <code>index</code> in <code>df2</code> and <code>df</code> that will match between the frames, however some of the <code>index</code> in <code>df2</code> doesn't exist in <code>df</code>. The remaining columns in <code>df2</code> are called <code>AUTHOR_NAME</code> and <code>CATEGORY</code>. The relevant column in <code>df</code> is called <code>AUTHOR_NAME</code>. Some of the <code>AUTHOR_NAME</code> in <code>df</code> doesn't exist in <code>df2</code> and vice versa. The instruction I want is: when <code>index</code> in <code>df</code> matches with <code>index</code> in <code>df2</code> and <code>title</code> in <code>df</code> matches with <code>title</code> in <code>df2</code>, add <code>category</code> to <code>df</code>, else add NaN in <code>category</code>. Example data: <pre class="prettyprint"><code>df2 AUTHOR_NAME CATEGORY Index Pub1 author1 main Pub2 author1 main Pub3 author1 main Pub1 author2 sub Pub3 author2 sub Pub2 author4 sub df AUTHOR_NAME ...n amount of other columns Index Pub1 author1 Pub2 author1 Pub1 author2 Pub1 author3 Pub2 author4 expected_result AUTHOR_NAME CATEGORY ...n amount of other columns Index Pub1 author1 main Pub2 author1 main Pub1 author2 sub Pub1 author3 NaN Pub2 author4 sub </code></pre> If I use <code>df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME'])</code> my <code>df</code> becomes three times bigger than it is supposed to be. So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use <code>df2</code> as a lookup table and then return <code>type</code> values to <code>df</code> depending on if certain conditions are met. <pre class="prettyprint"><code>def calculate_category(df2, d): category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])] return str(category_row['CATEGORY'].iat[0]) df.apply(lambda d: calculate_category(df2, d), axis=1) </code></pre> However, this throws me an error: <pre class="prettyprint"><code>IndexError: ('index out of bounds', u'occurred at index 7614') </code></pre>

Consider the following dataframes <code>df</code> and <code>df2</code> <pre class="prettyprint"><code>df = pd.DataFrame(dict( AUTHOR_NAME=list('AAABBCCCCDEEFGG'), title= list('zyxwvutsrqponml') )) df2 = pd.DataFrame(dict( AUTHOR_NAME=list('AABCCEGG'), title =list('zwvtrpml'), CATEGORY =list('11223344') )) </code></pre> option 1 <code>merge</code> <pre class="prettyprint"><code>df.merge(df2, how='left') </code></pre> option 2 <code>join</code> <pre class="prettyprint"><code>cols = ['AUTHOR_NAME', 'title'] df.join(df2.set_index(cols), on=cols) </code></pre> <hr> both options yield <img src="https://i.stack.imgur.com/AYxvS.png" alt="enter image description here">

Pandas populate new dataframe column based on matching columns in another dataframe

Tags:

python

merge

pandas

populate

I have a df which contains my main data which has one million rows. My main data also has 30 columns. Now I want to add another column to my df called category. The category is a column in df2 which contains around 700 rows and two other columns that will match with two columns in df.

I begin with setting an index in df2 and df that will match between the frames, however some of the index in df2 doesn't exist in df.

The remaining columns in df2 are called AUTHOR_NAME and CATEGORY.

The relevant column in df is called AUTHOR_NAME.

Some of the AUTHOR_NAME in df doesn't exist in df2 and vice versa.

The instruction I want is: when index in df matches with index in df2 and title in df matches with title in df2, add category to df, else add NaN in category.

Example data:

df2            AUTHOR_NAME              CATEGORY Index        Pub1        author1                 main Pub2        author1                 main Pub3        author1                 main Pub1        author2                 sub Pub3        author2                 sub Pub2        author4                 sub   df             AUTHOR_NAME     ...n amount of other columns         Index        Pub1        author1                  Pub2        author1      Pub1        author2  Pub1        author3 Pub2        author4   expected_result             AUTHOR_NAME             CATEGORY   ...n amount of other columns Index Pub1        author1                 main Pub2        author1                 main Pub1        author2                 sub Pub1        author3                 NaN Pub2        author4                 sub

If I use df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME']) my df becomes three times bigger than it is supposed to be.

So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use df2 as a lookup table and then return type values to df depending on if certain conditions are met.

def calculate_category(df2, d):     category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])]     return str(category_row['CATEGORY'].iat[0])  df.apply(lambda d: calculate_category(df2, d), axis=1)

However, this throws me an error:

IndexError: ('index out of bounds', u'occurred at index 7614')

847

asked Oct 02 '16 11:10

user3471881

1 Answers

Consider the following dataframes df and df2

df = pd.DataFrame(dict(         AUTHOR_NAME=list('AAABBCCCCDEEFGG'),         title=      list('zyxwvutsrqponml')     ))  df2 = pd.DataFrame(dict(         AUTHOR_NAME=list('AABCCEGG'),         title      =list('zwvtrpml'),         CATEGORY   =list('11223344')     ))

option 1
merge

df.merge(df2, how='left')

option 2
join

cols = ['AUTHOR_NAME', 'title'] df.join(df2.set_index(cols), on=cols)

both options yield

enter image description here

157

answered Oct 04 '22 21:10

piRSquared

Related questions
                            
                                AttributeError: 'Namespace' object has no attribute
                            
                                Identifier normalization: Why is the micro sign converted into the Greek letter mu?
                            
                                Pandas update multiple columns at once
                            
                                Left-align a pandas rolling object
                            
                                How to mock a dictionary in Python
                            
                                Serving Python (Flask) REST API over HTTP2
                            
                                Get the bounding box coordinates in the TensorFlow object detection API tutorial
                            
                                FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated use `arr[tuple(seq)]` instead of `arr[seq]`
                            
                                I have a high-performant function written in Julia, how can I use it from Python?
                            
                                How to maintain pip install options in requirements file made by pip freeze?
                            
                                Compare (assert equality of) two complex data structures containing numpy arrays in unittest
                            
                                PyEval_InitThreads in Python 3: How/when to call it? (the saga continues ad nauseam)
                            
                                Django: Can you tell if a related field has been prefetched without fetching it?
                            
                                Multiprocessing : More processes than cpu.count
                            
                                TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'Text'
                            
                                Understanding "score" returned by scikit-learn KMeans
                            
                                How to interpret TensorFlow output?
                            
                                Sympy - Comparing expressions
                            
                                Replace all occurrences that match regular expression
                            
                                OSError: [Errno 8] Exec format error selenium

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With