I have three datasets (final_NN, ppt_code, herd_id), and I wish to add a new column called MapValue in final_NN dataframe, and the value to be added can be retrieved from the other two dataframes, the rule is in the bottom after codes.
import pandas as pd
final_NN = pd.DataFrame({
"number": [123, 456, "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", "Unknown"],
"ID": ["", "", "", "", "", "", "", "", 799, 813],
"code": ["", "", "AA", "AA", "BB", "BB", "BB", "CC", "", ""]
})
ppt_code = pd.DataFrame({
"code": ["AA", "AA", "BB", "BB", "CC"],
"number": [11, 11, 22, 22, 33]
})
herd_id = pd.DataFrame({
"ID": [799, 813],
"number": [678, 789]
})
new_column = pd.Series([])
for i in range(len(final_NN)):
if final_NN["number"][i] != "" and final_NN["number"][i] != "Unknown":
new_column[i] = final_NN['number'][i]
elif final_NN["code"][i] != "":
for p in range(len(ppt_code)):
if ppt_code["code"][p] == final_NN["code"][i]:
new_column[i] = ppt_code["number"][p]
elif final_NN["ID"][i] != "":
for h in range(len(herd_id)):
if herd_id["ID"][h] == final_NN["ID"][i]:
new_column[i] = herd_id["number"][h]
else:
new_column[i] = ""
final_NN.insert(3, "MapValue", new_column)
print(final_NN)
final_NN:
number ID code
0 123
1 456
2 Unknown AA
3 Unknown AA
4 Unknown BB
5 Unknown BB
6 Unknown BB
7 Unknown CC
8 Unknown 799
9 Unknown 813
ppt_code:
code number
0 AA 11
1 AA 11
2 BB 22
3 BB 22
4 CC 33
herd_id:
ID number
0 799 678
1 813 789
Expected output:
number ID code MapValue
0 123 123
1 456 456
2 Unknown AA 11
3 Unknown AA 11
4 Unknown BB 22
5 Unknown BB 22
6 Unknown BB 22
7 Unknown CC 33
8 Unknown 799 678
9 Unknown 813 789
The rules is:
number in final_NN is not Unknown, MapValue = number in final_NN;number in final_NN is Unknown but code in final_NN is not Null, then search the ppt_code dataframe, and use the code and its corresponding "number" to map and fill in the "MapValue" in final_NN;number and code in final_NN are Unknown and null respectively, but ID in final_NN is not Null, then search herd_id dataframe, and use the ID and its corresponding number to fill in the MapValue in the first dataframe. I applied a loop through the dataframe which is a slow way to achieve this, as above. But I understand there could be a faster way to do this. Just wondering would anyone help me to have a fast and easier way to achieve the same result?First create a mapping series from the ppt_code and herd_id dataframes, then use Series.replace to create a new column MapNumber by replacing the Unknown values in number column with np.NaN, then use two consecutive Series.fillna along with Series.map to fill the missing values in MapNumber column according to the rules:
ppt_map = ppt_code.drop_duplicates(subset=['code']).set_index('code')['number']
hrd_map = herd_id.drop_duplicates(subset=['ID']).set_index('ID')['number']
final_NN['MapNumber'] = final_NN['number'].replace({'Unknown': np.nan})
final_NN['MapNumber'] = (
final_NN['MapNumber']
.fillna(final_NN['code'].map(ppt_map))
.fillna(final_NN['ID'].map(hrd_map))
)
Result:
# print(final_NN)
number ID code MapNumber
0 123 123.0
1 456 456.0
2 Unknown AA 11.0
3 Unknown AA 11.0
4 Unknown BB 22.0
5 Unknown BB 22.0
6 Unknown BB 22.0
7 Unknown CC 33.0
8 Unknown 799 678.0
9 Unknown 813 789.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With