I have a csv containing locations (latitude
,longitude
) for a given user denoted by the id
field, at a given time (timestamp
). I need to calculate the distance and the velocity between a point and the successive point for each user. For example, for ID 1 I need to find the distance and velocity between point 1 and point 2, point 2 and point 3, point 3 and point 4, and so on. Given I am working with coordinates on the Earth, I understand the Haversine metric will be used for distance calculations, however, I am unsure how to iterate though my file given the time and user order aspect to my problem. Given this, with python
, how can I iterate through my file to sort the events by user and by time, and then calculate the distance and velocity between each?
Ideally, the output would be a second csv looking something like: ID#, start_time, start_location, end_time, end_location, distance, velocity
.
Sample data below:
ID,timestamp,latitude,longitude
3,6/9/2017 22:20,38.7953326,77.0088833
1,5/5/2017 13:10,38.8890106,77.0500613
2,2/10/2017 16:23,40.7482494,73.9841913
1,5/5/2017 12:35,38.9206015,77.2223287
3,6/10/2017 10:00,42.3662109,71.0209426
1,5/5/2017 20:00,38.8974155,77.0368333
2,2/10/2017 7:30,38.8514261,77.0422981
3,6/9/2017 10:20,38.9173461,77.2225527
2,2/10/2017 19:51,40.7828687,73.9675438
3,6/10/2017 6:42,38.9542676,77.4496951
1,5/5/2017 16:35,38.8728748,77.0077629
2,2/10/2017 10:00,40.7769311,73.8761546
Seems like you could use the magic of pandas
.
It's easy to create a pandas dataframe
from a csv file using the read_csv()
function:
import pandas as pd
df = pd.read_csv(filename)
Based on your sample data, this will create the following dataframe
:
ID timestamp latitude longitude
0 3 6/9/2017 22:20 38.795333 77.008883
1 1 5/5/2017 13:10 38.889011 77.050061
2 2 2/10/2017 16:23 40.748249 73.984191
3 1 5/5/2017 12:35 38.920602 77.222329
4 3 6/10/2017 10:00 42.366211 71.020943
5 1 5/5/2017 20:00 38.897416 77.036833
6 2 2/10/2017 7:30 38.851426 77.042298
7 3 6/9/2017 10:20 38.917346 77.222553
8 2 2/10/2017 19:51 40.782869 73.967544
9 3 6/10/2017 6:42 38.954268 77.449695
10 1 5/5/2017 16:35 38.872875 77.007763
11 2 2/10/2017 10:00 40.776931 73.876155
Pandas (and python in general) has extensive libraries for date and time operations. But first, you will need to prepare your data by converting the timestamp column (a string) into a datetime object. I am assuming your data is in the format "MM/DD/YYYY"
(since you didn't specify).
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%m/%d/%Y %H:%M')
You're going to have to define some functions to compute the distance and the velocity. The Haversine distance function is adapted from this answer.
from math import sin, cos, sqrt, atan2, radians
def getDistanceFromLatLonInKm(lat1,lon1,lat2,lon2):
R = 6371 # Radius of the earth in km
dLat = radians(lat2-lat1)
dLon = radians(lon2-lon1)
rLat1 = radians(lat1)
rLat2 = radians(lat2)
a = sin(dLat/2) * sin(dLat/2) + cos(rLat1) * cos(rLat2) * sin(dLon/2) * sin(dLon/2)
c = 2 * atan2(sqrt(a), sqrt(1-a))
d = R * c # Distance in km
return d
def calc_velocity(dist_km, time_start, time_end):
"""Return 0 if time_start == time_end, avoid dividing by 0"""
return dist_km / (time_end - time_start).seconds if time_end > time_start else 0
We want to compute the Haversine function on each row, but we need some information from the first row for each group. Luckily, pandas
makes this easy with sort_values()
, groupby()
and transform()
.
The following code makes 3 new columns, one each for the initial latitude, longitude, and time for each ID.
# First sort by ID and timestamp:
df = df.sort_values(by=['ID', 'timestamp'])
# Group the sorted dataframe by ID, and grab the initial value for lat, lon, and time.
df['lat0'] = df.groupby('ID')['latitude'].transform(lambda x: x.iat[0])
df['lon0'] = df.groupby('ID')['longitude'].transform(lambda x: x.iat[0])
df['t0'] = df.groupby('ID')['timestamp'].transform(lambda x: x.iat[0])
# create a new column for distance
df['dist_km'] = df.apply(
lambda row: getDistanceFromLatLonInKm(
lat1=row['latitude'],
lon1=row['longitude'],
lat2=row['lat0'],
lon2=row['lon0']
),
axis=1
)
# create a new column for velocity
df['velocity_kmps'] = df.apply(
lambda row: calc_velocity(
dist_km=row['dist_km'],
time_start=row['t0'],
time_end=row['timestamp']
),
axis=1
)
>>> print(df[['ID', 'timestamp', 'latitude', 'longitude', 'dist_km', 'velocity_kmps']])
ID timestamp latitude longitude dist_km velocity_kmps
3 1 2017-05-05 12:35:00 38.920602 77.222329 0.000000 0.000000
1 1 2017-05-05 13:10:00 38.889011 77.050061 15.314742 0.007293
10 1 2017-05-05 16:35:00 38.872875 77.007763 19.312148 0.001341
5 1 2017-05-05 20:00:00 38.897416 77.036833 16.255868 0.000609
6 2 2017-02-10 07:30:00 38.851426 77.042298 0.000000 0.000000
11 2 2017-02-10 10:00:00 40.776931 73.876155 344.880549 0.038320
2 2 2017-02-10 16:23:00 40.748249 73.984191 335.727502 0.010498
8 2 2017-02-10 19:51:00 40.782869 73.967544 339.206320 0.007629
7 3 2017-06-09 10:20:00 38.917346 77.222553 0.000000 0.000000
0 3 2017-06-09 22:20:00 38.795333 77.008883 22.942974 0.000531
9 3 2017-06-10 06:42:00 38.954268 77.449695 20.070609 0.000274
4 3 2017-06-10 10:00:00 42.366211 71.020943 648.450485 0.007611
From here, I will leave it to you to figure out how to grab the last entry for each ID.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With