I'm fairly new to Python and data science. I'm working on the kaggle Outbrain competition, and all datasets referenced in my code can be found at https://www.kaggle.com/c/outbrain-click-prediction/data.
On to the problem: I have a dataframe with columns ['document_id', 'category_id', 'confidence_level']
. I would like to add a fourth column, 'max_cat'
, that returns the 'category_id'
value that corresponds to the greatest 'confidence_level'
value for the row's 'document_id'
.
import pandas as pd import numpy main_folder = r'...filepath\data_location' + '\\' docs_meta = pd.read_csv(main_folder + 'documents_meta.csv\documents_meta.csv',nrows=1000) docs_categories = pd.read_csv(main_folder + 'documents_categories.csv\documents_categories.csv',nrows=1000) docs_entities = pd.read_csv(main_folder + 'documents_entities.csv\documents_entities.csv',nrows=1000) docs_topics = pd.read_csv(main_folder + 'documents_topics.csv\documents_topics.csv',nrows=1000) def find_max(row,the_df,groupby_col,value_col,target_col): return the_df[the_df[groupby_col]==row[groupby_col]].loc[the_df[value_col].idxmax()][target_col] test = docs_categories.copy() test['max_cat'] = test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'))
This gives me the error: KeyError: ('document_id', 'occurred at index document_id')
Can anyone help explain either why this error occurred, or how to achieve my goal in a more efficient manner?
Thanks!
How to Fix the KeyError? We can simply fix the error by correcting the spelling of the key. If we are not sure about the spelling we can simply print the list of all column names and crosscheck.
We can avoid KeyError by using get() function to access the key value. If the key is missing, None is returned. We can also specify a default value to return when the key is missing.
The Python "KeyError: 1" exception is caused when we try to access a 1 key in a a dictionary that doesn't contain the key. To solve the error, set the key in the dictionary before trying to access it or conditionally set it if it doesn't exist.
Typically this error occurs when you simply misspell a column names or include an accidental space before or after the column name.
As answered by EdChum in the comments. The issue is that apply
works column wise by default (see the docs). Therefore, the column names cannot be accessed.
To specify that it should be applied to each row instead, axis=1
must be passed:
test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'), axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With