I have two data frames that I am trying to merge.
Dataframe A:
col1 col2 sub grade
0 1 34.32 x a
1 1 34.32 x b
2 1 34.33 y c
3 2 10.14 z b
4 3 33.01 z a
Dataframe B:
col1 col2 group ID
0 1 34.32 t z
1 1 54.32 s w
2 1 34.33 r z
3 2 10.14 q z
4 3 33.01 q e
I want to merge on col1 and col2. I've been pd.merge with the following syntax:
pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
However, I think I am running into issues joining on the float values of col2 since many rows are being dropped. Is there any way to use np.isclose to match the values of col2? When I reference the index of a particular value of col2 in either dataframe, the value has many more decimal places than what is displayed in the dataframe.
I would like the result to be:
col1 col2 sub grade group ID
0 1 34.32 x a t z
1 1 34.32 x b s w
2 1 54.32 s w NaN NaN
3 1 34.33 y c r z
4 2 10.14 z b q z
5 3 33.01 z a q e
concat() to Merge Two DataFrames by Index. You can concatenate two DataFrames by using pandas. concat() method by setting axis=1 , and by default, pd. concat is a row-wise outer join.
merge() can be used for all database join operations between DataFrame or named series objects. You have to pass an extra parameter “name” to the series in this case. For instance, pd. merge(S1, S2, right_index=True, left_index=True) .
You can use a little hack - multiple float columns by some constant like 100
, 1000
..., convert column to int
, merge
and last divide by constant:
N = 100
#thank you koalo for comment
A.col2 = np.round(A.col2*N).astype(int)
B.col2 = np.round(B.col2*N).astype(int)
df = pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
df.col2 = df.col2 / N
print (df)
col1 col2 sub grade group ID
0 1 34.32 x a t z
1 1 34.32 x b t z
2 1 34.33 y c r z
3 2 10.14 z b q z
4 3 33.01 z a q e
5 1 54.32 NaN NaN s w
I had a similar problem where I needed to identify matching rows with thousands of float columns and no identifier. This case is difficult because values can vary slightly due to rounding.
In this case, I used scipy.spatial.distance.cosine to get the cosine similarity between rows.
from scipy import distance
threshold = 0.99999
similarity = 1 - spatial.distance.cosine(row1, row2)
if similarity >= threshold:
# it's a match
else:
# loop and check another row pair
This won't work if you have duplicate or very similar rows, but when you have a large number of float columns and not too many of rows, it works well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With