I searched a lot for an answer, the closest question was Compare 2 columns of 2 different pandas dataframes, if the same insert 1 into the other in Python, but the answer to this person's particular problem was a simple merge, which doesn't answer the question in a general way.
I have two large dataframes, df1 (typically about 10 million rows), and df2 (about 130 million rows). I need to update values in three columns of df1 with values from three columns of df2, based on two df1 columns matching two df2 columns. It is imperative that the order of df1 remains unchanged, and that only rows with matching values get updated.
This is how the dataframes look like:
df1
chr snp x pos a1 a2
1 1-10020 0 10020 G A
1 1-10056 0 10056 C G
1 1-10108 0 10108 C G
1 1-10109 0 10109 C G
1 1-10139 0 10139 C T
Note that it's not always the case that the values of "snp" is chr-pos, it can take many other values with no link to any of the columns (like rs1234, indel-6032 etc)
df2
ID CHR STOP OCHR OSTOP
rs376643643 1 10040 1 10020
rs373328635 1 10066 1 10056
rs62651026 1 10208 1 10108
rs376007522 1 10209 1 10109
rs368469931 3 30247 1 10139
I need to update ['snp', 'chr', 'pos'] in df1 with df2[['ID', 'OCHR', 'OSTOP']] only when df1[['chr', 'pos']] matches df2[['OCHR', 'OSTOP']]
so in this case, after update, df1 would look like:
chr snp x pos a1 a2
1 rs376643643 0 10040 G A
1 rs373328635 0 10066 C G
1 rs62651026 0 10208 C G
1 rs376007522 0 10209 C G
3 rs368469931 0 30247 C T
I have used merge as a workaround:
df1 = pd.merge(df1, df2, how='left', left_on=["chr", "pos"], right_on=["OCHR", "OSTOP"],
left_index=False, right_index=False, sort=False)
and then
df1.loc[~df1.OCHR.isnull(), ["snp", "chr", "pos"]] = df1.loc[~df1.OCHR.isnull(), ["ID", "CHR", "STOP"]].values
and then remove the extra columns.
Yes, it works, but what would be a way to do that directly by comparing the values from both dataframes, I just don't know how to formulate it, and I couldn't find an answer anywhere; I guess it could be useful to get a general answer on this.
I tried that but it doesn't work:
df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["snp", "chr", "pos"]] = df2.loc[df2[['OCHR', 'OSTOP']] == df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["chr", "pos"]],['ID', ''CHR', 'STOP']].values
Thanks,
Stephane
You can use the update
function (requires setting the matching criteria to index). I've modified your sample data to allow some mismatch.
# your data
# =====================
# df1 pos is modified from 10020 to 10010
print(df1)
chr snp x pos a1 a2
0 1 1-10020 0 10010 G A
1 1 1-10056 0 10056 C G
2 1 1-10108 0 10108 C G
3 1 1-10109 0 10109 C G
4 1 1-10139 0 10139 C T
print(df2)
ID CHR STOP OCHR OSTOP
0 rs376643643 1 10040 1 10020
1 rs373328635 1 10066 1 10056
2 rs62651026 1 10208 1 10108
3 rs376007522 1 10209 1 10109
4 rs368469931 3 30247 1 10139
# processing
# ==========================
# set matching columns to multi-level index
x1 = df1.set_index(['chr', 'pos'])['snp']
x2 = df2.set_index(['OCHR', 'OSTOP'])['ID']
# call update function, this is inplace
x1.update(x2)
# replace the values in original df1
df1['snp'] = x1.values
print(df1)
chr snp x pos a1 a2
0 1 1-10020 0 10010 G A
1 1 rs373328635 0 10056 C G
2 1 rs62651026 0 10108 C G
3 1 rs376007522 0 10109 C G
4 1 rs368469931 0 10139 C T
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With