Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas populate new dataframe column based on matching columns in another dataframe

I have a df which contains my main data which has one million rows. My main data also has 30 columns. Now I want to add another column to my df called category. The category is a column in df2 which contains around 700 rows and two other columns that will match with two columns in df.

I begin with setting an index in df2 and df that will match between the frames, however some of the index in df2 doesn't exist in df.

The remaining columns in df2 are called AUTHOR_NAME and CATEGORY.

The relevant column in df is called AUTHOR_NAME.

Some of the AUTHOR_NAME in df doesn't exist in df2 and vice versa.

The instruction I want is: when index in df matches with index in df2 and title in df matches with title in df2, add category to df, else add NaN in category.

Example data:

df2            AUTHOR_NAME              CATEGORY Index        Pub1        author1                 main Pub2        author1                 main Pub3        author1                 main Pub1        author2                 sub Pub3        author2                 sub Pub2        author4                 sub   df             AUTHOR_NAME     ...n amount of other columns         Index        Pub1        author1                  Pub2        author1      Pub1        author2  Pub1        author3 Pub2        author4   expected_result             AUTHOR_NAME             CATEGORY   ...n amount of other columns Index Pub1        author1                 main Pub2        author1                 main Pub1        author2                 sub Pub1        author3                 NaN Pub2        author4                 sub 

If I use df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME']) my df becomes three times bigger than it is supposed to be.

So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use df2 as a lookup table and then return type values to df depending on if certain conditions are met.

def calculate_category(df2, d):     category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])]     return str(category_row['CATEGORY'].iat[0])  df.apply(lambda d: calculate_category(df2, d), axis=1) 

However, this throws me an error:

IndexError: ('index out of bounds', u'occurred at index 7614') 
like image 847
user3471881 Avatar asked Oct 02 '16 11:10

user3471881


People also ask

How do I get a column value of a pandas DataFrame based on another column?

You can extract a column of pandas DataFrame based on another value by using the DataFrame. query() method. The query() is used to query the columns of a DataFrame with a boolean expression.

How do you create a new column in pandas by subtracting two columns?

We can create a function specifically for subtracting the columns, by taking column data as arguments and then using the apply method to apply it to all the data points throughout the column.


1 Answers

Consider the following dataframes df and df2

df = pd.DataFrame(dict(         AUTHOR_NAME=list('AAABBCCCCDEEFGG'),         title=      list('zyxwvutsrqponml')     ))  df2 = pd.DataFrame(dict(         AUTHOR_NAME=list('AABCCEGG'),         title      =list('zwvtrpml'),         CATEGORY   =list('11223344')     )) 

option 1
merge

df.merge(df2, how='left') 

option 2
join

cols = ['AUTHOR_NAME', 'title'] df.join(df2.set_index(cols), on=cols) 

both options yield

enter image description here

like image 157
piRSquared Avatar answered Oct 04 '22 21:10

piRSquared