For a dataset of 200M GPS (lon, lat) coordinates of vessels I want to calculate an approximate distance to the nearest land or coastline, as a function called distance_to_shore, that will return the distance and country of that shore.
I'm using a shape file of country boundaries and coastlines from: http://www.naturalearthdata.com/
Some considerations are that the Oceanic pole of inaccessibility is 2688 km. So this would be the maximum possible distance from shore, this could be used to create some kind of bounding box. I want to calculate accounting for the Earth's curvature (not Euclidean), e.g. Haversine, or Vincenty method.
For this I started looking at scipy.spatial.cKDTree, but this does not allow for Haversine distance metric. On the other hand the sklearn.neighbors.BallTree, does allows for Haversine distance metric but I can't get it to work. Here is the code I have so far. N.B. the function should ideally be vectorized.
############################### SOLUTION ###############################
Thanks for all the input this is how I solved it in Python, including functions to download relevant shape files, needs some cleaning
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import shapely as sp
import cartopy.io.shapereader as shpreader
import ssl
import urllib.request
import zipfile
from shutil import rmtree
from dbfread import DBF
from scipy import spatial
from sklearn.neighbors import NearestNeighbors, BallTree
from pyproj import Proj, transform
from math import *
coastline = np.load(os.path.join(os.path.dirname(__file__),
'../data/shape_files/coast_coords_10m.npy'))
ports = np.load(os.path.join(os.path.dirname(__file__),
'../data/shape_files/ports_coords.npy'))
def extract_geom_meta(country):
'''
extract from each geometry the name of the country
and the geom_point data. The output will be a list
of tuples and the country name as the last element.
'''
geoms = country.geometry
coords = np.empty(shape=[0, 2])
for geom in geoms:
coords = np.append(coords, geom.exterior.coords, axis = 0)
country_name = country.attributes["ADMIN"]
return [coords, country_name]
def save_coastline_shape_file():
'''
store shp files locally, this functions will download
shapefiles for the whole planet.
'''
ne_earth = shpreader.natural_earth(resolution = '10m',
category = 'cultural',
name='admin_0_countries')
reader = shpreader.Reader(ne_earth)
countries = reader.records()
# extract and create separate objects
world_geoms = [extract_geom_meta(country) for country in countries]
coords_countries = np.vstack([[np.array(x[:-1]), x[-1]]
for x in world_geoms])
coastline = np.save(os.path.join(os.path.dirname(__file__),
'../data/shape_files/coast_coords_10m.npy')
, coords_countries)
print('Saving coordinates (...)')
def distance_to_shore(lon, lat):
'''
This function will create a numpy array of distances
to shore. It will contain and ID for AIS points and
the distance to the nearest coastline point.
'''
coastline_coords = np.vstack([np.flip(x[0][0], axis=1) for x in coastline])
countries = np.hstack([np.repeat(str(x[1]), len(x[0][0])) for x in coastline])
tree = BallTree(np.radians(coastline_coords), metric='haversine')
coords = pd.concat([np.radians(lat), np.radians(lon)], axis=1)
dist, ind = tree.query(coords, k=1)
df_distance_to_shore = pd.Series(dist.flatten()*6371, name='distance_to_shore')
df_countries = pd.Series(countries[ind].flatten(), name='shore_country')
return pd.concat([df_distance_to_shore, df_countries], axis=1)
Judging distance at sea: Height of eye With a height of eye of three metres, the horizon is 3.6 miles away (at two metres it is 2.9 miles). This means that if you can see the ship's bow wave and waterline you are within this distance.
So, for all practical purposes, we can estimate the distance to the horizon using 1.22459√h (or perhaps even more simply as just 1.2√h), which gives the distance in miles when h is in feet. The metric version is s ≈ 3.56972 √h ≈ 3.6 √h in kilometers, with h in meters.
The efficient way of solving this problem is to store all your coast points into a vantage point tree using the geodesic distance as your metric (it's important that the metric satisfy the triangle inequality). Then for each vessel you can query the VP tree to find the closed point.
If there are M coast points and N vessels. Then the time to construct the VP tree requires M log M distance calculations. Each query requires log M distance calculations. A distance calculation for the ellipsoid takes about 2.5 μs. So the total time is (M + N) log M × 2.5 μs.
Here is code using my library GeographicLib (version 1.47 or later) to carry out this calculation. This is just a stripped-down version of the example given for the NearestNeighbor class.
// Example of using the GeographicLib::NearestNeighbor class. Read lon/lat
// points for coast from coast.txt and lon/lat for vessels from vessels.txt.
// For each vessel, print to standard output: the index for the closest point
// on coast and the distance to it.
// This requires GeographicLib version 1.47 or later.
// Compile/link with, e.g.,
// g++ -I/usr/local/include -lGeographic -L/usr/local/bin -Wl,-rpath=/usr/local/lib -o coast coast.cpp
// Run time for 30000 coast points and 46217 vessels is 3 secs.
#include <iostream>
#include <exception>
#include <vector>
#include <fstream>
#include <GeographicLib/NearestNeighbor.hpp>
#include <GeographicLib/Geodesic.hpp>
using namespace std;
using namespace GeographicLib;
// A structure to hold a geographic coordinate.
struct pos {
double _lat, _lon;
pos(double lat = 0, double lon = 0) : _lat(lat), _lon(lon) {}
};
// A class to compute the distance between 2 positions.
class DistanceCalculator {
private:
Geodesic _geod;
public:
explicit DistanceCalculator(const Geodesic& geod) : _geod(geod) {}
double operator() (const pos& a, const pos& b) const {
double d;
_geod.Inverse(a._lat, a._lon, b._lat, b._lon, d);
if ( !(d >= 0) )
// Catch illegal positions which result in d = NaN
throw GeographicErr("distance doesn't satisfy d >= 0");
return d;
}
};
int main() {
try {
// Read in coast
vector<pos> coast;
double lat, lon;
{
ifstream is("coast.txt");
if (!is.good())
throw GeographicErr("coast.txt not readable");
while (is >> lon >> lat)
coast.push_back(pos(lat, lon));
if (coast.size() == 0)
throw GeographicErr("need at least one location");
}
// Define a distance function object
DistanceCalculator distance(Geodesic::WGS84());
// Create NearestNeighbor object
NearestNeighbor<double, pos, DistanceCalculator>
coastset(coast, distance);
ifstream is("vessels.txt");
double d;
int count = 0;
vector<int> k;
while (is >> lon >> lat) {
++count;
d = coastset.Search(coast, distance, pos(lat, lon), k);
if (k.size() != 1)
throw GeographicErr("unexpected number of results");
cout << k[0] << " " << d << "\n";
}
}
catch (const exception& e) {
cerr << "Caught exception: " << e.what() << "\n";
return 1;
}
}
This example is in C++. To use python, you'll need to find a python implementation of VP trees and then you can use the python version of GeographicLib for the distance calculations.
P.S. GeographicLib uses an accurate algorithm for the geodesic distance that satisfies the triangle inequality. The Vincenty method fails to converge for nearly antipodal points and so does not satisfy the triangle inequality.
ADDENDUM: here's the python implementation: Install vptree and geographiclib
pip install vptree geographiclib
coast points (lon,lat) are in coast.txt; vessel positions (lon,lat) are in vessels.txt. Run
import numpy
import vptree
from geographiclib.geodesic import Geodesic
def geoddist(p1, p2):
# p1 = [lon1, lat1] in degrees
# p2 = [lon2, lat2] in degrees
return Geodesic.WGS84.Inverse(p1[1], p1[0], p2[1], p2[0])['s12']
coast = vptree.VPTree(numpy.loadtxt('coast.txt'), geoddist)
print('vessel closest-coast dist')
for v in numpy.loadtxt('vessels.txt'):
c = coast.get_nearest_neighbor(v)
print(list(v), list(c[1]), c[0])
For 30000 coast points and 46217 vessels, this takes 18 min 3 secs.
This is longer than I expected. The time to construct the tree is
1 min 16 secs. So the total time should be about 3 min.
For 30000 coast points and 46217 vessels, this takes 4 min (using version 1.1.1 of vptree). For comparison, the time using the GeographicLib C++ library is 3 secs.
LATER: I looked into why the python vptree is slow. The number of
distance calculations to set up the tree is the same for GeographicLib's
C++ implementation and python vptree package: 387248 which is about M
log M, for M = 30000. (Here logs are base 2 and I set the bucket
size to 1 for both implementations to ease comparisons.) The mean
number of distance calculations for each vessel lookup for the C++
implementation is 14.7 which is close to the expected value, log M =
14.9. However the equivalent statistic for the python implementation is
108.9, a factor for 7.4 larger.
Various factors influence the efficiency of the VP tree: the choice of vantage points, how the search is ordered, etc. A discussion of these considerations for the GeographicLib implementation is given here. I will ping the author of the python package about this.
STILL LATER: I've submitted a pull request which cures the major problems with the efficiency of the python package vptree. The CPU time for my test is now about 4 min. The number of distance calculations per query is 16.7 (close to the figure for GeographicLib::NearestNeighbor, 14.7).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With