A better way to map data in multiple datasets, with multiple data mapping rules

Question

I have three datasets (final_NN, ppt_code, herd_id), and I wish to add a new column called MapValue in final_NN dataframe, and the value to be added can be retrieved from the other two dataframes, the rule is in the bottom after codes.

import pandas as pd

final_NN = pd.DataFrame({
    "number": [123, 456, "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", "Unknown"],
    "ID": ["", "", "", "", "", "", "", "", 799, 813],
    "code": ["", "", "AA", "AA", "BB", "BB", "BB", "CC", "", ""]
})

ppt_code = pd.DataFrame({
    "code": ["AA", "AA", "BB", "BB", "CC"],
    "number": [11, 11, 22, 22, 33]
})

herd_id = pd.DataFrame({
    "ID": [799, 813],
    "number": [678, 789]
})

new_column = pd.Series([])
for i in range(len(final_NN)):
    if final_NN["number"][i] != "" and final_NN["number"][i] != "Unknown":
        new_column[i] = final_NN['number'][i]

    elif final_NN["code"][i] != "":
        for p in range(len(ppt_code)):
            if ppt_code["code"][p] == final_NN["code"][i]:
                new_column[i] = ppt_code["number"][p]

    elif final_NN["ID"][i] != "":
        for h in range(len(herd_id)):
            if herd_id["ID"][h] == final_NN["ID"][i]:
                new_column[i] = herd_id["number"][h]

    else:
        new_column[i] = ""

final_NN.insert(3, "MapValue", new_column)
print(final_NN)

final_NN:

    number   ID code
0      123          
1      456          
2  Unknown        AA
3  Unknown        AA
4  Unknown        BB
5  Unknown        BB
6  Unknown        BB
7  Unknown        CC
8  Unknown  799     
9  Unknown  813

ppt_code:

  code  number
0   AA      11
1   AA      11
2   BB      22
3   BB      22
4   CC      33

herd_id:

    ID  number
0  799     678
1  813     789

Expected output:

    number   ID code   MapValue
0      123                  123
1      456                  456
2  Unknown        AA         11
3  Unknown        AA         11
4  Unknown        BB         22
5  Unknown        BB         22
6  Unknown        BB         22
7  Unknown        CC         33
8  Unknown  799             678
9  Unknown  813             789

The rules is:

if number in final_NN is not Unknown, MapValue = number in final_NN;
if number in final_NN is Unknown but code in final_NN is not Null, then search the ppt_code dataframe, and use the code and its corresponding "number" to map and fill in the "MapValue" in final_NN;
if both number and code in final_NN are Unknown and null respectively, but ID in final_NN is not Null, then search herd_id dataframe, and use the ID and its corresponding number to fill in the MapValue in the first dataframe. I applied a loop through the dataframe which is a slow way to achieve this, as above. But I understand there could be a faster way to do this. Just wondering would anyone help me to have a fast and easier way to achieve the same result?

Shubham Sharma · Accepted Answer

First create a mapping series from the ppt_code and herd_id dataframes, then use Series.replace to create a new column MapNumber by replacing the Unknown values in number column with np.NaN, then use two consecutive Series.fillna along with Series.map to fill the missing values in MapNumber column according to the rules:

ppt_map = ppt_code.drop_duplicates(subset=['code']).set_index('code')['number']
hrd_map = herd_id.drop_duplicates(subset=['ID']).set_index('ID')['number']

final_NN['MapNumber'] = final_NN['number'].replace({'Unknown': np.nan})
final_NN['MapNumber'] = (
    final_NN['MapNumber']
    .fillna(final_NN['code'].map(ppt_map))
    .fillna(final_NN['ID'].map(hrd_map))
)

Result:

# print(final_NN)

    number   ID code  MapNumber
0      123                123.0
1      456                456.0
2  Unknown        AA       11.0
3  Unknown        AA       11.0
4  Unknown        BB       22.0
5  Unknown        BB       22.0
6  Unknown        BB       22.0
7  Unknown        CC       33.0
8  Unknown  799           678.0
9  Unknown  813           789.0

A better way to map data in multiple datasets, with multiple data mapping rules

Tags:

XaviorL

1 Answers

Shubham Sharma

Recent Activity

Donate For Us

A better way to map data in multiple datasets, with multiple data mapping rules

Tags:

XaviorL

1 Answers

Shubham Sharma

Related questions

Recent Activity

Donate For Us