Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Validate in merge function pandas

Today I was trying to go a little deeper in the merge() function of pandas, and I found the option validate, which, as reported in the documentation, can be:

validate : string, default None

If specified, checks if merge is of specified type.

“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets. “one_to_many” or “1:m”: check if merge keys are unique in left dataset. “many_to_one” or “m:1”: check if merge keys are unique in right dataset. "many_to_many” or “m:m”: allowed, but does not result in checks.

I have looked around to find a working example on where and how to use this function, but I couldn't find any. Moreover when I tried to apply it to a group of DataFrame's I was merging, it didn't seem to change the output. Can anyone give me a working example, to make me understand it better?

Thanks in advance,

Mattia

like image 818
iraciv94 Avatar asked Jun 27 '18 08:06

iraciv94


1 Answers

The new valdate param will raise a MergeError if the validation fails, example:

df1 = pd.DataFrame({'a':list('aabc'),'b':np.random.randn(4)})
df2 = pd.DataFrame({'a':list('aabc'),'b':np.random.randn(4)})

print(df1)
print(df2)

   a         b
0  a -2.557152
1  a -0.145969
2  b -1.629560
3  c -0.233517

   a         b
0  a -0.352038
1  a  0.490438
2  b  0.319452
3  c -0.599481

Now if we merge on column 'a' without validate:

In[39]:
df1.merge(df2, on='a')

Out[39]: 
   a       b_x       b_y
0  a -2.557152 -0.352038
1  a -2.557152  0.490438
2  a -0.145969 -0.352038
3  a -0.145969  0.490438
4  b -1.629560  0.319452
5  c -0.233517 -0.599481

This works but we get more rows for 'a' as column 'b' is different, now we pass validate='1:1', we get an error:

MergeError: Merge keys are not unique in either
left or right dataset; not a one-to-one merge

if we pass validate='1:m' we get a different error:

MergeError: Merge keys are not unique in left
dataset;not a one-to-many merge

Again this fails the validation, if we pass 'm:m':

In[42]:
df1.merge(df2, on='a',validate='m:m')

Out[42]: 
   a       b_x       b_y
0  a -2.557152 -0.352038
1  a -2.557152  0.490438
2  a -0.145969 -0.352038
3  a -0.145969  0.490438
4  b -1.629560  0.319452
5  c -0.233517 -0.599481

no error occurs and we get the same merged df if we had not passed the validate param

The api docs don't give an example but the what's new section does, the original github enhancement also gives further background information

like image 159
EdChum Avatar answered Sep 18 '22 17:09

EdChum