Consistently getting a keyerror when I try to merge two data frames. The code:
c = pd.merge(a, b, on='video_id', how='left')
Based on internet research I double checked the dtype and coerced both to int:
a = pd.read_csv(filename, index_col=False, dtype={'video_id': np.int64}, low_memory=False)
b = pd.read_csv(videoinfo, index_col=False, dtype={'video_id': np.int64})
Renamed the columns (to make sure they match):
a.columns.values[2] = "video_id"
b.columns.values[0] = "video_id"
Coerced to df:
c = pd.merge(pd.DataFrame(a), pd.DataFrame(b), on='video_id', how='left')
Out of ideas as to why I'm still getting the keyerror. And it's always "KeyError: 'video_id'"
How to Fix the KeyError? We can simply fix the error by correcting the spelling of the key. If we are not sure about the spelling we can simply print the list of all column names and crosscheck.
Pandas DataFrame merge() function is used to merge two DataFrame objects with a database-style join operation. The joining is performed on columns or indexes. If the joining is done on columns, indexes are ignored. This function returns a new DataFrame and the source DataFrame objects are unchanged.
Combine Two Series Using pandas. merge() can be used for all database join operations between DataFrame or named series objects. You have to pass an extra parameter “name” to the series in this case. For instance, pd. merge(S1, S2, right_index=True, left_index=True) .
Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.
You want to be careful not to use df.columns.values
to rename columns. Doing so screws with the indexing on your column names.
If you know which column names you're replacing, you can try something like this:
a.rename(columns={'old_col_name':'video_id'}, inplace = True)
b.rename(columns={'old_col_name':'video_id'}, inplace = True)
If you don't know the column names ahead of time, you can try:
col_names_a = a.columns
col_names_a[index] = 'video_id'
a.columns = col_names_a
Keep in mind, you actually don't need to use the same column names on both dataframes. Pandas allows you to specify the individual names in each dataframe
pd.merge(a, b, left_on = 'a_col', right_on = 'b_col', how = 'left')
There was a leading space in one of the dfs in the column name, 'video_id '
instead of 'video_id'
. Not sure why the initial rename didn't fix that but it's fixed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With