Select rows from a pandas dataframe with a numpy 2D array on multiple columns

Tags:

Data

I have a dataframe that contains 5 columns:

Latitude and longitude of origin (origin_lat, origin_lng)
Latitude and longitude of destination (dest_lat, dest_lng)
A score which was computed from the other fields

I have a matrix M that contains pairs of origin and destination latitude/longitude. Some of these pairs exists in the dataframe, other do not.

Goal

My goal is two-fold:

Select all pairs from M that are not present in the first four column of the dataframe, apply a function func to them (to calculate the score column), and append the results to the existing dataframe. Note: We should not recalculate the score for already existing rows.
After adding the missing rows, select all the rows defined by the selection matrix M in a new dataframe dfs.

Example code

# STEP 1: Generate example data
ctr_lat = 40.676762
ctr_lng = -73.926420
N = 12
N2 = 3

data = np.array([ctr_lat+np.random.random((N))/10,
                 ctr_lng+np.random.random((N))/10,
                 ctr_lat+np.random.random((N))/10,
                 ctr_lng+np.random.random((N))/10]).transpose()

# Example function - does not matter what it does
def func(x):
    return np.random.random()

# Create dataframe
geocols = ['origin_lat','origin_lng','dest_lat','dest_lng']
df = pd.DataFrame(data,columns=geocols)
df['score'] = df.apply(func,axis=1)

Which gives me a dataframe df like this:

    origin_lat  origin_lng   dest_lat   dest_lng     score
0    40.684887  -73.924921  40.758641 -73.847438  0.820080
1    40.703129  -73.885330  40.774341 -73.881671  0.104320
2    40.761998  -73.898955  40.767681 -73.865001  0.564296
3    40.736863  -73.859832  40.681693 -73.907879  0.605974
4    40.761298  -73.853480  40.696195 -73.846205  0.779520
5    40.712225  -73.892623  40.722372 -73.868877  0.628447
6    40.683086  -73.846077  40.730014 -73.900831  0.320041
7    40.726003  -73.909059  40.760083 -73.829180  0.903317
8    40.748258  -73.839682  40.713100 -73.834253  0.457138
9    40.761590  -73.923624  40.746552 -73.870352  0.867617
10   40.748064  -73.913599  40.746997 -73.894851  0.836674
11   40.771164  -73.855319  40.703426 -73.829990  0.010908

I can then artificially create the selection matrix M which contains 3 rows that exists in the dataframe, and 3 rows that do not.

# STEP 2: Generate data to select
# As an example, I select 3 rows that are part of the dataframe, and 3 that are not
data2 = np.array([ctr_lat+np.random.random((N2))/10,
                  ctr_lng+np.random.random((N2))/10,
                  ctr_lat+np.random.random((N2))/10,
                  ctr_lng+np.random.random((N2))/10]).transpose()

M = np.concatenate((data[4:7,:],data2))

The matrix M looks like this:

array([[ 40.7612977 , -73.85348031,  40.69619549, -73.84620489],
       [ 40.71222463, -73.8926234 ,  40.72237185, -73.86887696],
       [ 40.68308567, -73.84607722,  40.73001434, -73.90083107],
       [ 40.7588412 , -73.87128079,  40.76750639, -73.91945371],
       [ 40.74686156, -73.84804047,  40.72378653, -73.92207075],
       [ 40.6922673 , -73.88275402,  40.69708748, -73.87905543]])

From here, I do not know how to know which rows from M are not present in df and add them. I do not know either how to select all the rows from df that are in M.

Ideas

My idea was to identify the missing rows, append them to df with a nan score, and recompute the score for the nan rows only. However, I do not know how to select these rows efficiently without looping on each element of the matrix M.

Any suggestion? Thanks a lot for your help!

544

asked Sep 11 '17 17:09

nbeuchat

1 Answers

Is there any reason not to use merge ?

df2 = pd.DataFrame(M, columns=geocols) 
df = df.merge(df2, how='outer')
ix = df.score.isnull()
df.loc[ix, 'score'] = df.loc[ix].apply(func, axis=1)

It does exactly what you proposed : adds the missing rows df with a nan score, identifies nans, calculates the scores for those rows.

154

answered Oct 09 '22 09:10

igrinis

Related questions
                            
                                Run all tests from subdirectories in Python
                            
                                python docopt: "expected string or buffer"
                            
                                Pandas Iterrows Row Number & Percentage
                            
                                Python: Dictionary key name that changes dynamically in a loop
                            
                                Numpy's float32 and float comparisons
                            
                                python split text by quotes and spaces
                            
                                shutil.move if directory already exists
                            
                                A multi-threading example of the python GIL
                            
                                Fix PIL.ImageDraw.Draw.line with wide lines
                            
                                Access remote DB via ssh tunnel (Python 3)
                            
                                Listing all tests associated with a given marker in Pytest
                            
                                TypeError: 'ImmutableMultiDict' objects are immutable python
                            
                                serving image files from django admin
                            
                                Exception handled surprisingly in Pyside slots
                            
                                Seconds until end of day in python
                            
                                Proper way to access a column of a pandas dataframe
                            
                                HTTP Error 404 from googlefinance in python 2.7
                            
                                How to checkout to a new branch with Pygithub?
                            
                                How to override the __dir__ method for a class?
                            
                                Unpacking an array in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Select rows from a pandas dataframe with a numpy 2D array on multiple columns

Tags:

python

select

pandas

dataframe

numpy

nbeuchat

People also ask

1 Answers

igrinis

Recent Activity

Donate For Us