Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A better way to map data in multiple datasets, with multiple data mapping rules

Tags:

I have three datasets (final_NN, ppt_code, herd_id), and I wish to add a new column called MapValue in final_NN dataframe, and the value to be added can be retrieved from the other two dataframes, the rule is in the bottom after codes.

import pandas as pd

final_NN = pd.DataFrame({
    "number": [123, 456, "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", "Unknown"],
    "ID": ["", "", "", "", "", "", "", "", 799, 813],
    "code": ["", "", "AA", "AA", "BB", "BB", "BB", "CC", "", ""]
})

ppt_code = pd.DataFrame({
    "code": ["AA", "AA", "BB", "BB", "CC"],
    "number": [11, 11, 22, 22, 33]
})

herd_id = pd.DataFrame({
    "ID": [799, 813],
    "number": [678, 789]
})

new_column = pd.Series([])
for i in range(len(final_NN)):
    if final_NN["number"][i] != "" and final_NN["number"][i] != "Unknown":
        new_column[i] = final_NN['number'][i]

    elif final_NN["code"][i] != "":
        for p in range(len(ppt_code)):
            if ppt_code["code"][p] == final_NN["code"][i]:
                new_column[i] = ppt_code["number"][p]

    elif final_NN["ID"][i] != "":
        for h in range(len(herd_id)):
            if herd_id["ID"][h] == final_NN["ID"][i]:
                new_column[i] = herd_id["number"][h]

    else:
        new_column[i] = ""

final_NN.insert(3, "MapValue", new_column)
print(final_NN)

final_NN:

    number   ID code
0      123          
1      456          
2  Unknown        AA
3  Unknown        AA
4  Unknown        BB
5  Unknown        BB
6  Unknown        BB
7  Unknown        CC
8  Unknown  799     
9  Unknown  813 

ppt_code:

  code  number
0   AA      11
1   AA      11
2   BB      22
3   BB      22
4   CC      33

herd_id:

    ID  number
0  799     678
1  813     789

Expected output:

    number   ID code   MapValue
0      123                  123
1      456                  456
2  Unknown        AA         11
3  Unknown        AA         11
4  Unknown        BB         22
5  Unknown        BB         22
6  Unknown        BB         22
7  Unknown        CC         33
8  Unknown  799             678
9  Unknown  813             789

The rules is:

  1. if number in final_NN is not Unknown, MapValue = number in final_NN;
  2. if number in final_NN is Unknown but code in final_NN is not Null, then search the ppt_code dataframe, and use the code and its corresponding "number" to map and fill in the "MapValue" in final_NN;
  3. if both number and code in final_NN are Unknown and null respectively, but ID in final_NN is not Null, then search herd_id dataframe, and use the ID and its corresponding number to fill in the MapValue in the first dataframe. I applied a loop through the dataframe which is a slow way to achieve this, as above. But I understand there could be a faster way to do this. Just wondering would anyone help me to have a fast and easier way to achieve the same result?
like image 281
XaviorL Avatar asked Jun 23 '20 04:06

XaviorL


1 Answers

First create a mapping series from the ppt_code and herd_id dataframes, then use Series.replace to create a new column MapNumber by replacing the Unknown values in number column with np.NaN, then use two consecutive Series.fillna along with Series.map to fill the missing values in MapNumber column according to the rules:

ppt_map = ppt_code.drop_duplicates(subset=['code']).set_index('code')['number']
hrd_map = herd_id.drop_duplicates(subset=['ID']).set_index('ID')['number']

final_NN['MapNumber'] = final_NN['number'].replace({'Unknown': np.nan})
final_NN['MapNumber'] = (
    final_NN['MapNumber']
    .fillna(final_NN['code'].map(ppt_map))
    .fillna(final_NN['ID'].map(hrd_map))
)

Result:

# print(final_NN)

    number   ID code  MapNumber
0      123                123.0
1      456                456.0
2  Unknown        AA       11.0
3  Unknown        AA       11.0
4  Unknown        BB       22.0
5  Unknown        BB       22.0
6  Unknown        BB       22.0
7  Unknown        CC       33.0
8  Unknown  799           678.0
9  Unknown  813           789.0
like image 197
Shubham Sharma Avatar answered Oct 02 '22 16:10

Shubham Sharma