Today I was trying to go a little deeper in the merge()
function of pandas, and I found the option validate
, which, as reported in the documentation, can be:
validate : string, default None
If specified, checks if merge is of specified type.
“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets. “one_to_many” or “1:m”: check if merge keys are unique in left dataset. “many_to_one” or “m:1”: check if merge keys are unique in right dataset. "many_to_many” or “m:m”: allowed, but does not result in checks.
I have looked around to find a working example on where and how to use this function, but I couldn't find any. Moreover when I tried to apply it to a group of DataFrame
's I was merging, it didn't seem to change the output.
Can anyone give me a working example, to make me understand it better?
Thanks in advance,
Mattia
The new valdate
param will raise a MergeError
if the validation fails, example:
df1 = pd.DataFrame({'a':list('aabc'),'b':np.random.randn(4)})
df2 = pd.DataFrame({'a':list('aabc'),'b':np.random.randn(4)})
print(df1)
print(df2)
a b
0 a -2.557152
1 a -0.145969
2 b -1.629560
3 c -0.233517
a b
0 a -0.352038
1 a 0.490438
2 b 0.319452
3 c -0.599481
Now if we merge on column 'a'
without validate
:
In[39]:
df1.merge(df2, on='a')
Out[39]:
a b_x b_y
0 a -2.557152 -0.352038
1 a -2.557152 0.490438
2 a -0.145969 -0.352038
3 a -0.145969 0.490438
4 b -1.629560 0.319452
5 c -0.233517 -0.599481
This works but we get more rows for 'a' as column 'b' is different, now we pass validate='1:1'
, we get an error:
MergeError: Merge keys are not unique in either
left or right dataset; not a one-to-one merge
if we pass validate='1:m'
we get a different error:
MergeError: Merge keys are not unique in left
dataset;not a one-to-many merge
Again this fails the validation, if we pass 'm:m'
:
In[42]:
df1.merge(df2, on='a',validate='m:m')
Out[42]:
a b_x b_y
0 a -2.557152 -0.352038
1 a -2.557152 0.490438
2 a -0.145969 -0.352038
3 a -0.145969 0.490438
4 b -1.629560 0.319452
5 c -0.233517 -0.599481
no error occurs and we get the same merged df if we had not passed the validate
param
The api docs don't give an example but the what's new section does, the original github enhancement also gives further background information
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With