I have a table with this format :
| User | lat | lon |
|---|---|---|
| u1 | x1 | y1 |
| u1 | x2 | y2 |
| u1 | x3 | y3 |
| u2 | x4 | y4 |
| u2 | x5 | y5 |
| u2 | x6 | y6 |
| u3 | x7 | y7 |
| u3 | x8 | y8 |
What I'd like to do is have a table where for each user I have the distance between the farthest 2 points they've been to.
| User | max_dist_km |
|---|---|
| u1 | 15.2 |
| u2 | 23.7 |
| u3 | 8.3 |
The naive way is to loop over users, create the distance matrix for each user and extract the max distance. This wouldn't be scalable with a huge set of users.
Is there a more efficient and elegant way to proceed ?
Implemented a fast algorithm which works in linear time
Has linear complexity i.e. O(N)
Perform the following steps:
import numpy as np
def lat_lon_to_xyz(lat, lon):
'''
Convert latitude/longitude to x, y, z in Earth centered coordinates (assuming spherical earth)
lat, lon are in degrees radian
Source: https://stackoverflow.com/questions/1185408/converting-from-longitude-latitude-to-cartesian-coordinates
'''
lat_radians = np.deg2rad(lat)
lon_radians = np.deg2rad(lon)
R = 1 # use unit sphere rather than 6371 radius of earth in km
x = R * np.cos(lat_radians) * np.cos(lon_radians)
y = R * np.cos(lat_radians) * np.sin(lon_radians)
z = R *np.sin(lat_radians)
return np.array([x, y, z])
def furthest_points_spadsman(points):
'''
Based upon the following technique which scales linearly with the number of points
- Initialize P1 to the center of mass of the points
- Repeat the following 3 times (once is normally enough but multiple times handles corner cases):
- Set P0 = P1
- Set P1 = the point in points with maximum distance from P0
- P0 and P1 are the furthest two points in x, y, z
Technique from following reference.
Reference: https://stackoverflow.com/q/16865291/
'''
# Initialize to mean
p_1 = np.mean(points, axis = 0)
for _ in range(3): # Iterating mitigates corner cases
p_0 = p_1
# Point in points furthest distance from p_0
# note: can use squared distance since monotonical
p_1 = points[np.argmax(np.sum(np.square(points - p_0), axis = -1))]
return p_0, p_1
def haversine(point1, point2):
'''
Data in point1 and point2 are latitude/longitude pairs,
with first number is the latitude (north-south),
and the second number is the longitude (east-west)
Source: https://medium.com/@petehouston/calculate-distance-of-two-locations-on-earth-using-python-1501b1944d97
'''
R = 6371 # Earth radius in km
point1 = np.deg2rad(point1)
point2 = np.deg2rad(point2)
delta = point2 - point1
a = (np.sin(delta[0] / 2) ** 2 +
np.cos(point1[0]) * np.cos(point2[0]) * np.sin(delta[1] / 2) ** 2)
return 2 * R * np.arcsin(np.sqrt(a))
def process(df, user = 'user', lat_field ='lat', lon_field = 'lon'):
'''
Generates the Dataframe containing the maximum distance by user of a set of points
The process works as following steps.
1. Group latitude/longitude data by user
2. Repeat steps 3-7 for each user
3. Map latitudes/longitudes points to x, y, z coordinates using spherical earth approximation)
4. Find two furthest points as follows:
i. calculate the center of mass M of the points
ii. find the point P0 that has the maximum distance to M
iii. find the point P1 that has the maximum distance to P0
iv. P0 and P1 are the furthest two points in x, y, z
5. Use indexes of P0 & P1 to lookup latitude/longitude from original lat/log data
6. Calcualte distance between P0 & P1 using Haversine
7. Update results
8. Return results as a dataframe
Process based upon following references:
a. https://stackoverflow.com/questions/16865291/greatest-distance-between-set-of-longitude-latitude-points/16870359#16870359
b. https://medium.com/@petehouston/calculate-distance-of-two-locations-on-earth-using-python-1501b1944d97
'''
results = [] # holds list of tuples of (user, distance)
for user_, g in df.groupby(user): # Step 1--Group latitude/longitude data by user
# Step 2--Repeat steps 2-4 for each user
points_lat_lon = g[[lat_field, lon_field]].to_numpy()
# Step 3--map latitudes/longitudes points to x, y, z coordinates
points_xyz = lat_lon_to_xyz(points_lat_lon[:, 0], points_lat_lon[:, 1]).transpose()
# Step 4--Find two furthest points
# Find two furthest points in xyz (using spherical earth aproximation)
p_0, p_1 = furthest_points_spadsman(points_xyz)
# Step 5--Use indexes of P0 & P1 to lookup latitude/longitude from original lat/log data
# Index of p_0 and p_1 in points_xyz (so we also corresponds to the index in points_lat_lon)
index_0 = np.where(np.prod(points_xyz == p_0, axis = -1))[0][0]
index_1 = np.where(np.prod(points_xyz == p_1, axis = -1))[0][0]
lat_lon_0 = points_lat_lon[index_0, :]
lat_lon_1 = points_lat_lon[index_1, :]
# Step 6--Calcualte distance between P0 & P1 using Haversine
distance = haversine(lat_lon_0, lat_lon_1)
# Step 7--update results
results.append((user_, distance))
# Step 8--Return results as a dataframe
return pd.DataFrame(results, columns = [user, 'Max_Distance_km'])
Computed maximum distance between cities in the United States
from time import time
import pandas as pd
# CSV file downloadable from https://simplemaps.com/data/us-cities
# Datafile with 30, 409 records
cities = pd.read_csv('simplemaps_uscities_basicv1.75/uscities.csv')
t0 = time()
result = process(cities, user = 'state_id', lat_field = 'lat', lon_field = 'lng')
print(f'Processing time: {time()-t0:.3f} seconds')
print(f'Results:\n{result}')
Processing time: 0.104 seconds
Results:
state_id Max_Distance_km
0 AK 3586.855864
1 AL 569.292071
2 AR 492.544129
3 AZ 712.434590
4 CA 1321.284443
5 CO 697.572158
6 CT 182.286421
7 DC 0.000000
8 DE 156.778146
9 FL 936.595405
10 GA 589.700716
11 HI 574.129490
12 IA 538.297210
13 ID 825.044994
14 IL 622.014829
15 IN 496.787181
16 KS 682.563079
17 KY 633.576282
18 LA 601.891459
19 MA 301.815349
20 MD 397.753918
21 ME 509.556000
22 MI 743.578849
23 MN 751.324104
24 MO 707.260076
25 MS 534.872877
26 MT 961.640222
27 NC 778.308918
28 ND 582.080515
29 NE 763.370612
30 NH 249.275265
31 NJ 259.273945
32 NM 747.581138
33 NV 807.834661
34 NY 641.785757
35 OH 471.708115
36 OK 826.431505
37 OR 649.340103
38 PA 508.693319
39 PR 205.710138
40 RI 81.539958
41 SC 435.894534
42 SD 688.135798
43 TN 751.286457
44 TX 1240.972424
45 UT 611.262766
46 VA 729.361836
47 VT 285.877877
48 WA 616.073484
49 WI 570.813035
50 WV 441.834382
51 WY 682.873519
Find furthest distances traveled by animals in animal tracking data.
from time import time
import pandas as pd
# Data downloaded from above kaggle link
df = pd.read_csv('migration_original.csv/migration_original.csv')
t0 = time()
result = process(df, user = 'individual-local-identifier', lat_field = 'location-lat', lon_field = 'location-long')
print(f'Processing time: {time()-t0:.3f} seconds')
print(f'Results:\n{result}')
Processing time: 0.325 seconds
Results:
individual-local-identifier Max_Distance_km
0 91732A 7073.629785
1 91733A 65.788571
2 91734A 3446.277830
3 91735A 231.789762
4 91737A 5484.820693
.. ... ...
121 91920A 2535.920902
122 91921A 26.698255
123 91924A 14.518173
124 91929A 0.806871
125 91930A 10.427890
[126 rows x 2 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With