Extract max distance for IDs that visited multiple (lat,lon)

Question

I have a table with this format :

User	lat	lon
u1	x1	y1
u1	x2	y2
u1	x3	y3
u2	x4	y4
u2	x5	y5
u2	x6	y6
u3	x7	y7
u3	x8	y8

What I'd like to do is have a table where for each user I have the distance between the farthest 2 points they've been to.

User	max_dist_km
u1	15.2
u2	23.7
u3	8.3

The naive way is to loop over users, create the distance matrix for each user and extract the max distance. This wouldn't be scalable with a huge set of users.

Is there a more efficient and elegant way to proceed ?

DarrylG · Accepted Answer

Summary

Implemented a fast algorithm which works in linear time

US Cities Dataset (30, 409 records): 0.103 seconds
Animal tracking dataset (89,867 records): 0.325 seconds
Timings on 10+ year old windows desktop (i7 920 CPU @ 2.67GHz)

Approach

Has linear complexity i.e. O(N)

N is the total number of lats/lons (i.e. counting across all user)

Perform the following steps:

Group latitude/longitude data by user
Repeat steps 3-7 for each user
Map latitudes/longitudes points to x, y, z coordinates using spherical earth approximation
Find the two furthest points as follows:
- Initialize P1 to the center of mass of the points
- Repeat the following 3 times (once is normally enough but multiple times handles corner cases):
  - Set P0 = P1
  - Set P1 = the point in points at maximum distance from P0
- P0 and P1 are the furthest two points in x, y, z
Use indexes of P0 & P1 to lookup latitude/longitude from original lat/log data
Calculate the distance between P0 & P1 using Haversine
Update results with the current user's distance
Return results for all users as a data frame

Code

import numpy as np

def lat_lon_to_xyz(lat, lon):
    '''
        Convert latitude/longitude to x, y, z in Earth centered coordinates (assuming spherical earth)
        
        lat, lon are in degrees radian
        
        Source: https://stackoverflow.com/questions/1185408/converting-from-longitude-latitude-to-cartesian-coordinates
    '''
    lat_radians = np.deg2rad(lat)
    lon_radians = np.deg2rad(lon)
    
    R = 1  # use unit sphere rather than 6371 radius of earth in km 
    x = R * np.cos(lat_radians) * np.cos(lon_radians)
    y = R * np.cos(lat_radians) * np.sin(lon_radians)
    z = R *np.sin(lat_radians)
    
    return np.array([x, y, z])
    
def furthest_points_spadsman(points):
    '''
        Based upon the following technique which scales linearly with the number of points
        
        - Initialize P1 to the center of mass of the points
        - Repeat the following 3 times (once is normally enough but multiple times handles corner cases):
          - Set P0 = P1
          - Set P1 = the point in points with maximum distance from P0
          - P0 and P1 are the furthest two points in x, y, z
        
        Technique from following reference.
        Reference: https://stackoverflow.com/q/16865291/
    '''
    # Initialize to mean
    p_1 = np.mean(points, axis = 0)
    
    for _ in range(3): # Iterating mitigates corner cases
        p_0 = p_1
        # Point in points furthest distance from p_0
        # note: can use squared distance since monotonical
        p_1 = points[np.argmax(np.sum(np.square(points - p_0), axis = -1))]
    
    return p_0, p_1

def haversine(point1, point2):
    '''
        Data in point1 and point2 are latitude/longitude pairs, 
        with first number is the latitude (north-south), 
        and the second number is the longitude (east-west)
        
        Source: https://medium.com/@petehouston/calculate-distance-of-two-locations-on-earth-using-python-1501b1944d97
    '''
    R = 6371  # Earth radius in km
    
    point1 = np.deg2rad(point1)
    point2 = np.deg2rad(point2)
    
    delta = point2 - point1
    
    a = (np.sin(delta[0] / 2) ** 2 + 
         np.cos(point1[0]) * np.cos(point2[0]) * np.sin(delta[1] / 2) ** 2)
    
    return 2 * R * np.arcsin(np.sqrt(a))
    
def process(df, user = 'user', lat_field ='lat', lon_field = 'lon'):
    '''
        Generates the Dataframe containing the maximum distance by user of a set of points
        
        The process works as following steps.
        1.  Group latitude/longitude data by user
        2.  Repeat steps 3-7 for each user
        3.  Map latitudes/longitudes points to x, y, z coordinates using spherical earth approximation)
        4.  Find two furthest points as follows:
            i. calculate the center of mass M of the points
            ii. find the point P0 that has the maximum distance to M
            iii. find the point P1 that has the maximum distance to P0
            iv. P0 and P1 are the furthest two points in x, y, z
        5. Use indexes of P0 & P1 to lookup latitude/longitude from original lat/log data
        6. Calcualte distance between P0 & P1 using Haversine
        7. Update results
        8. Return results as a dataframe
        
         Process based upon following references:
         a. https://stackoverflow.com/questions/16865291/greatest-distance-between-set-of-longitude-latitude-points/16870359#16870359
         b. https://medium.com/@petehouston/calculate-distance-of-two-locations-on-earth-using-python-1501b1944d97
    
    '''
    results = []                              # holds list of tuples of (user, distance)
    for user_, g in df.groupby(user):            # Step 1--Group latitude/longitude data by user
        # Step 2--Repeat steps 2-4 for each user
        points_lat_lon = g[[lat_field, lon_field]].to_numpy()

        # Step 3--map latitudes/longitudes points to x, y, z coordinates
        points_xyz = lat_lon_to_xyz(points_lat_lon[:, 0], points_lat_lon[:, 1]).transpose()

        # Step 4--Find two furthest points
        # Find two furthest points in xyz (using spherical earth aproximation)
        p_0, p_1 = furthest_points_spadsman(points_xyz)
        
        # Step 5--Use indexes of P0 & P1 to lookup latitude/longitude from original lat/log data
        # Index of p_0 and p_1 in points_xyz (so we also corresponds to the index in points_lat_lon)
        index_0 = np.where(np.prod(points_xyz == p_0, axis = -1))[0][0]
        index_1 = np.where(np.prod(points_xyz == p_1, axis = -1))[0][0]

        lat_lon_0 = points_lat_lon[index_0, :]
        lat_lon_1 = points_lat_lon[index_1, :]
     
        # Step 6--Calcualte distance between P0 & P1 using Haversine
        distance = haversine(lat_lon_0, lat_lon_1)
        
        # Step 7--update results
        results.append((user_, distance))
    
    # Step 8--Return results as a dataframe
    return pd.DataFrame(results, columns = [user, 'Max_Distance_km'])

Tests

Test 1

Description

Computed maximum distance between cities in the United States

Used state id as user
Total 30, 409 records (multiple records per city and state)
Each record included state id, lat, long
Processing time for 30, 409 records: 0.104 seconds on 10+ year old windows desktop (i7 920 CPU @ 2.67GHz)

Dataset

Downloaded from this site: simplemaps
Contains many cities per state
Used state id as user (i.e. found max distances between cities by state)

Test Code

from time import time
import pandas as pd

# CSV file downloadable from https://simplemaps.com/data/us-cities
# Datafile with 30, 409 records
cities = pd.read_csv('simplemaps_uscities_basicv1.75/uscities.csv')

t0 = time()
result = process(cities, user = 'state_id', lat_field = 'lat', lon_field = 'lng')
print(f'Processing time: {time()-t0:.3f} seconds')
print(f'Results:
{result}')

Output

Processing time: 0.104 seconds
Results:
   state_id  Max_Distance_km
0        AK      3586.855864
1        AL       569.292071
2        AR       492.544129
3        AZ       712.434590
4        CA      1321.284443
5        CO       697.572158
6        CT       182.286421
7        DC         0.000000
8        DE       156.778146
9        FL       936.595405
10       GA       589.700716
11       HI       574.129490
12       IA       538.297210
13       ID       825.044994
14       IL       622.014829
15       IN       496.787181
16       KS       682.563079
17       KY       633.576282
18       LA       601.891459
19       MA       301.815349
20       MD       397.753918
21       ME       509.556000
22       MI       743.578849
23       MN       751.324104
24       MO       707.260076
25       MS       534.872877
26       MT       961.640222
27       NC       778.308918
28       ND       582.080515
29       NE       763.370612
30       NH       249.275265
31       NJ       259.273945
32       NM       747.581138
33       NV       807.834661
34       NY       641.785757
35       OH       471.708115
36       OK       826.431505
37       OR       649.340103
38       PA       508.693319
39       PR       205.710138
40       RI        81.539958
41       SC       435.894534
42       SD       688.135798
43       TN       751.286457
44       TX      1240.972424
45       UT       611.262766
46       VA       729.361836
47       VT       285.877877
48       WA       616.073484
49       WI       570.813035
50       WV       441.834382
51       WY       682.873519

Test 2

Description

Find furthest distances traveled by animals in animal tracking data.

126 different animal tags (e.g. users)
89, 867 data records
Processed in 0.325 seconds

Dataset

Movebank is an online database of animal tracking data hosted by the Max Planck Institute of Animal Behavior.
Used Movebank dataset from Kaggle.
Data Source

Test Code

from time import time
import pandas as pd

# Data downloaded from above kaggle link
df = pd.read_csv('migration_original.csv/migration_original.csv')

t0 = time()
result = process(df, user = 'individual-local-identifier', lat_field = 'location-lat', lon_field = 'location-long')
print(f'Processing time: {time()-t0:.3f} seconds')
print(f'Results:
{result}')

Output

Processing time: 0.325 seconds
Results:
    individual-local-identifier  Max_Distance_km
0                        91732A      7073.629785
1                        91733A        65.788571
2                        91734A      3446.277830
3                        91735A       231.789762
4                        91737A      5484.820693
..                          ...              ...
121                      91920A      2535.920902
122                      91921A        26.698255
123                      91924A        14.518173
124                      91929A         0.806871
125                      91930A        10.427890

[126 rows x 2 columns]

References

Greatest distance between set of longitude/latitude points
Calculate distance of two locations on Earth using Python

Acknowledgements

Thanks to @MangoNrFiv whose comments helped improve the implementation and testing.

Extract max distance for IDs that visited multiple (lat,lon)

Tags:

python

pandas

geolocation

distance

geopandas

mlx

1 Answers

Summary

Approach

Code

Tests

Test 1

Description

Dataset

Test Code

Output

Test 2

Description

Dataset

Test Code

Output

References

Acknowledgements

DarrylG

Recent Activity

Donate For Us

Extract max distance for IDs that visited multiple (lat,lon)

Tags:

python

pandas

geolocation

distance

geopandas

mlx

1 Answers

Summary

Approach

Code

Tests

Test 1

Description

Dataset

Test Code

Output

Test 2

Description

Dataset

Test Code

Output

References

Acknowledgements

DarrylG

Related questions

Recent Activity

Donate For Us