Improved efficiency versus iterating over two large Pandas Dataframes

Q: Is pandas efficient for large data sets?

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

Q: What is faster for looping over a DataFrame Iterrows or Itertuples?

Itertuples(): Itertuples() iterates through the data frame by converting each row of data as a list of tuples. itertuples() takes 16 seconds to iterate through a data frame with 10 million records that are around 50x times faster than iterrows().

Q: Are Dicts faster than Dataframes?

For certain small, targeted purposes, a dict may be faster. And if that is all you need, then use a dict, for sure! But if you need/want the power and luxury of a DataFrame, then a dict is no substitute. It is meaningless to compare speed if the data structure does not first satisfy your needs.

Tags:

performance

python

loops

pandas

dataframe

I have two HUGE Pandas dataframes with location based values, and I need to update df1['count'] with the number of records from df2 which are less than 1000m from each point in df1.

Here's an example of my data, imported into Pandas

df1 =       lat      long    valA   count
        0   123.456  986.54  1      0
        1   223.456  886.54  2      0
        2   323.456  786.54  3      0
        3   423.456  686.54  2      0
        4   523.456  586.54  1      0

df2 =       lat      long    valB
        0   123.456  986.54  1
        1   223.456  886.54  2
        2   323.456  786.54  3
        3   423.456  686.54  2
        4   523.456  586.54  1

In reality, df1 has ~10 million rows, and df2 has ~1 million

I created a working nested FOR loop using the Pandas DF.itertuples() method, which works fine for smaller test data sets (df1=1k Rows & df2=100 Rows takes about an hour to complete), but the full data set is exponentially larger, and will take years to complete based on my calculations. Here's my working code...

import pandas as pd
import geopy.distance as gpd

file1 = 'C:\\path\\file1.csv'    
file2 = 'C:\\path\\file2.csv' 

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

df1.sort_values(['long', 'lat']), inplace=True) 
df2.sort_values(['long', 'lat']), inplace=True)

for irow in df1.itertuples():    
     count = 0
     indexLst = []        
     Location1 = (irow[1], irow[2])    

     for jrow in df2.itertuples():  
          Location2 = (jrow[1], jrow[2])                                      
          if gpd.distance(Location1, Location2).kilometers < 1:
             count += 1
             indexLst.append(jrow[0])    
     if count > 0:                  #only update DF if a match is found
         df1.at[irow[0],'count'] = (count)      
         df2.drop(indexLst, inplace=True)       #drop rows already counted from df2 to speed up next iteration

 #save updated df1 to new csv file
 outFileName = 'combined.csv'
 df1.to_csv(outFileName, sep=',', index=False)

Each point in df2 need only be counted once since points in df1 are evenly spaced. To that end I added a drop statment to remove rows from df2 once they have been counted in hopes of improving iteration time. I also tried creating a merge/join statement initially, instead of the nested loops, but was unsuccessful.

At this stage, any help on improving efficiency here is greatly appreciated!

Edit: Goal is to update the 'count' column in df1 (as shown below) with the number of points from df2 which are <1km, and output to a new file.

df1 =       lat      long    valA   count
        0   123.456  986.54  1      3
        1   223.456  886.54  2      1
        2   323.456  786.54  3      9
        3   423.456  686.54  2      2
        4   523.456  586.54  1      5

937

asked Nov 26 '18 21:11

dP8884

2 Answers

Having done this kind of thing frequently, I've found a couple of best practices:

1) Try to use numpy and numba as much as possible

2) Try to leverage parallelization as much as possible

3) Skip loops for vectorized code (we use a loop with numba here to leverage parallelization).

In this particular case, I want to point out the slowdown introduced by geopy. While it is a great package and produces pretty accurate distances (compared to the Haversine method), it is much slower (have not looked at implementation as to why).

import numpy as np
from geopy import distance

origin = (np.random.uniform(-90,90), np.random.uniform(-180,180))
dest = (np.random.uniform(-90,90), np.random.uniform(-180,180))

%timeit distance.distance(origin, dest)

216 µs ± 363 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Which means at that time interval, computing 10 million x 1 million distances will take approximately 2160000000 seconds or 600k hours. Even parallelism will only help so much.

Because you're interested when the points are very close, I would suggest using Haversine distance (which is less accurate at greater distances).

from numba import jit, prange, vectorize

@vectorize
def haversine(s_lat,s_lng,e_lat,e_lng):

    # approximate radius of earth in km
    R = 6373.0

    s_lat = s_lat*np.pi/180.0                      
    s_lng = np.deg2rad(s_lng)     
    e_lat = np.deg2rad(e_lat)                       
    e_lng = np.deg2rad(e_lng)  

    d = np.sin((e_lat - s_lat)/2)**2 + np.cos(s_lat)*np.cos(e_lat) * np.sin((e_lng - s_lng)/2)**2

    return 2 * R * np.arcsin(np.sqrt(d))

%timeit haversine(origin[0], origin[0], dest[1], dest[1])

1.85 µs ± 53.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

That's already a 100-fold improvement. But we can do better. You may have noticed the @vectorize decorator I added from numba. This allows the previously scalar Haversine function to become vectorized and take vectors as inputs. We'll leverage this in the next step:

@jit(nopython=True, parallel=True)
def get_nearby_count(coords, coords2, max_dist):
    '''
    Input: `coords`: List of coordinates, lat-lngs in an n x 2 array
           `coords2`: Second list of coordinates, lat-lngs in an k x 2 array
           `max_dist`: Max distance to be considered nearby
    Output: Array of length n with a count of coords nearby coords2
    '''
    # initialize
    n = coords.shape[0]
    k = coords2.shape[0]
    output = np.zeros(n)

    # prange is a parallel loop when operations are independent
    for i in prange(n):
        # comparing a point in coords to the arrays in coords2
        x, y = coords[i]
        # returns an array of length k
        dist = haversine(x, y, coords2[:,0], coords2[:,1])
        # sum the boolean of distances less than the max allowable
        output[i] = np.sum(dist < max_dist)

    return output

Hopefully, you'll now have an array equal to the length of the first set of coordinates (10 million in your case). You can then just assign this to your data frame as your count!

Test time 100,000 x 10,000:

n = 100_000
k = 10_000

coords1 = np.zeros((n, 2))
coords2 = np.zeros((k, 2))

coords1[:,0] = np.random.uniform(-90, 90, n)
coords1[:,1] = np.random.uniform(-180, 180, n)
coords2[:,0] = np.random.uniform(-90, 90, k)
coords2[:,1] = np.random.uniform(-180, 180, k)

%timeit get_nearby_count(coords1, coords2, 1.0)

2.45 s ± 73.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Unfortunately, that still means you'll be looking at something around 20,000+ seconds. And this was on a machine with 80 cores (utilizing 76ish, based on top usage).

That's the best I can do for now, good luck (also, first post, thanks for inspiring me to contribute!)

PS: You might also look into Dask arrays and the function, map_block(), to parallelize this function (instead of relying on prange). How you partition the data may influence total execution time.

PPS: 1,000,000 x 100,000 (100x smaller than your full set) took: 3min 27s (207 seconds) so the scaling appears to be linear and a bit forgiving.

PPPS: Implemented with a simple latitude difference filter:

@jit(nopython=True, parallel=True)
def get_nearby_count_vlat(coords, coords2, max_dist):
    '''
    Input: `coords`: List of coordinates, lat-lngs in an n x 2 array
           `coords2`: List of port coordinates, lat-lngs in an k x 2 array
           `max_dist`: Max distance to be considered nearby
    Output: Array of length n with a count of coords nearby coords2
    '''
    # initialize
    n = coords.shape[0]
    k = coords2.shape[0]
    coords2_abs = np.abs(coords2)
    output = np.zeros(n)

    # prange is a parallel loop when operations are independent
    for i in prange(n):
        # comparing a point in coords to the arrays in coords2
        point = coords[i]
        # subsetting coords2 to reduce haversine calc time. Value .02 is from playing with Gmaps and will need to change for max_dist > 1.0
        coords2_filtered = coords2[np.abs(point[0] - coords2[:,0]) < .02]
        # in case of no matches
        if coords2_filtered.shape[0] == 0: continue
        # returns an array of length k
        dist = haversine(point[0], point[1], coords2_filtered[:,0], coords2_filtered[:,1])
        # sum the boolean of distances less than the max allowable
        output[i] = np.sum(dist < max_dist)

    return output

answered Oct 19 '22 04:10

ernest

I've done something similar lately but not with lat,lon and I only had to find the nearest point and its distance. For this, I used the scipy.spatial.cKDTree package. It was quite fast. cKDTree

I think that in your case, you could use the query_ball_point() function.

from scipy import spatial
import pandas as pd

file1 = 'C:\\path\\file1.csv'    
file2 = 'C:\\path\\file2.csv' 

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
# Build the index
tree = spatial.cKDTree(df1[['long', 'lat']])
# Then query the index

You should give it a try.

answered Oct 19 '22 05:10

Akarius

Related questions
                            
                                Download attachment from Gmail in python: no "data" key
                            
                                QTableView Selecion Change
                            
                                PyCharm Code Folding/Outlining Generates Wrong Boundaries
                            
                                Why numpy fft return incorrect phase information?
                            
                                python bisect.insort(list, value)
                            
                                .py file opens PyCharm instead of running the script
                            
                                Django Rest Framework: allow a serializer field to be created, but not edited
                            
                                How to prevent raise asyncio.TimeoutError and continue the loop
                            
                                Test if a numpy array is a member of a list of numpy arrays, and remove it from the list
                            
                                List comparison of element
                            
                                Something wrong with how I'm bundling rasterio into an executable
                            
                                Python: How to script virtual environment building and activation?
                            
                                SQLAlchemy polymorphic on multiple identities for a class
                            
                                Taking in multiple inputs for a fixed time [duplicate]
                            
                                Pandas groupby city and month and fill in missing months
                            
                                Pycharm TabError: inconsistent use of tabs and spaces in indentation
                            
                                Trying to find sums of unique values within a nested dictionary. (See example!)
                            
                                How to do a "element by element in-place inverse" with pytorch?
                            
                                Negative Bounds for Slice Operator [duplicate]
                            
                                convert csv to json (nested objects)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With