Merge pandas DataFrame on column of float values

Tags:

I have two data frames that I am trying to merge.

Dataframe A:

    col1    col2    sub    grade
0   1       34.32   x       a 
1   1       34.32   x       b
2   1       34.33   y       c
3   2       10.14   z       b
4   3       33.01   z       a

Dataframe B:

    col1    col2    group   ID
0   1       34.32   t       z 
1   1       54.32   s       w
2   1       34.33   r       z
3   2       10.14   q       z
4   3       33.01   q       e

I want to merge on col1 and col2. I've been pd.merge with the following syntax:

pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])

However, I think I am running into issues joining on the float values of col2 since many rows are being dropped. Is there any way to use np.isclose to match the values of col2? When I reference the index of a particular value of col2 in either dataframe, the value has many more decimal places than what is displayed in the dataframe.

I would like the result to be:

    col1   col2   sub   grade   group    ID
0   1      34.32  x     a       t        z
1   1      34.32  x     b       s        w
2   1      54.32  s     w       NaN      NaN
3   1      34.33  y     c       r        z
4   2      10.14  z     b       q        z
5   3      33.01  z     a       q        e

772

asked Dec 14 '16 05:12

Megan

2 Answers

You can use a little hack - multiple float columns by some constant like 100, 1000..., convert column to int, merge and last divide by constant:

N = 100
#thank you koalo for comment
A.col2 = np.round(A.col2*N).astype(int) 
B.col2 = np.round(B.col2*N).astype(int) 
df = pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
df.col2 = df.col2 / N
print (df)
   col1   col2  sub grade group ID
0     1  34.32    x     a     t  z
1     1  34.32    x     b     t  z
2     1  34.33    y     c     r  z
3     2  10.14    z     b     q  z
4     3  33.01    z     a     q  e
5     1  54.32  NaN   NaN     s  w

184

answered Oct 14 '22 16:10

jezrael

I had a similar problem where I needed to identify matching rows with thousands of float columns and no identifier. This case is difficult because values can vary slightly due to rounding.

In this case, I used scipy.spatial.distance.cosine to get the cosine similarity between rows.

from scipy import distance

threshold = 0.99999
similarity = 1 - spatial.distance.cosine(row1, row2)

if similarity >= threshold:
    # it's a match
else:
    # loop and check another row pair

This won't work if you have duplicate or very similar rows, but when you have a large number of float columns and not too many of rows, it works well.

answered Oct 14 '22 17:10

Sesquipedalism

Related questions
                            
                                Creating a config file for Python Program
                            
                                How to specify the `dtype` of index when read a csv file to `DataFrame`?
                            
                                Retrieve distinct values from the hash key - DynamoDB
                            
                                sklearn: How to reset a Regressor or classifier object in sknn
                            
                                Python multiprocessing pool hangs on map call
                            
                                How do define an attribute in Python 3 enum class that is NOT an enum value? [duplicate]
                            
                                Are classobjects singletons?
                            
                                Flask SQLAlchemy NOT NULL constraint failed on primary key
                            
                                Is it possible to download apk from google play programmatically to PC?
                            
                                Dynamically creating python class from a protobuf file at run time?
                            
                                Python manager.dict() is very slow compared to regular dict
                            
                                How do I search a list that is in a nested list (list of list) without loop in Python?
                            
                                Removing data between double squiggly brackets with nested sub brackets in python
                            
                                Iterate through a dictionary in reverse order (Python)
                            
                                Get a list of all private channels with Slack API
                            
                                Generate a n-dimensional array of coordinates in numpy
                            
                                Limiting execution time of embedded Python
                            
                                Compute first order derivative with MongoDB aggregation framework
                            
                                How to include chromedriver with pyinstaller?
                            
                                unable to install JQ via PIP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Merge pandas DataFrame on column of float values

Tags:

python

merge

pandas

Megan

People also ask

2 Answers

jezrael

Sesquipedalism

Recent Activity

Donate For Us