Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ocean latitude longitude point distance from shore

I started a "free" open-sourced project to create a new data set for pH of the earth oceans.

I started from the open data-set from NOAA and created a 2.45 millions rows data-set with those columns:

colnames(NOAA_NODC_OSD_SUR_pH_7to9)
[1] "Year"  "Month" "Day"   "Hour"  "Lat"   "Long"  "Depth" "pH"   

Method document HERE.

Data-set HERE.

My goal now is to "qualify" each row (2.45m)... to do so, I need to calculate the distance from each point of Lat/Long to the nearest shore.

So I am looking for a method that would take In: Lat/Long Out: Distance (km from shore)

With this, I can qualify if the data point can be affected from shore contamination, like nearby city effluence for example.

I have search for a method to do this, but all seems to need packages/software that I don't have.

If someone would be willing to help out, I would appreciate. Or if you know of an easy (free) method to accomplish this, please let me know...

I can work in R programming, Shell scripts stuff, but not an expert of those....

like image 479
Simon Filiatrault Avatar asked Dec 29 '14 23:12

Simon Filiatrault


1 Answers

So there are several things going on here. First, your dataset seems to have pH vs. depth. So while there are ~ 2.5MM rows, there are only ~200,000 rows with depth=0 - still a lot.

Second, to get distance to nearest coast you need a shapefile of coastlines. Fortunately this is available here, at the excellent Natural Earth website.

Third, your data is in long/lat (so, units = degrees), but you want distance in km, so you need to transform your data (the coastline data above is also in long/lat and also needs to be transformed). One problem with transformations is that your data is evidently global, and any global transformation will necessarily be non-planar. So the accuracy will depend on the actual location. The right way to do this is to grid your data and then use a set of planar transformations appropriate to whichever grid your points are in. This is beyond the scope of this question, though, so we'll use a global transformation (mollweide) just to give you an idea of how it's done in R.

library(rgdal)   # for readOGR(...); loads package sp as well
library(rgeos)   # for gDistance(...)

setwd(" < directory with all your files > ")
# WGS84 long/lat
wgs.84    <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
# ESRI:54009 world mollweide projection, units = meters
# see http://www.spatialreference.org/ref/esri/54009/
mollweide <- "+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
df        <- read.csv("OSD_All.csv")
sp.points <- SpatialPoints(df[df$Depth==0,c("Long","Lat")], proj4string=CRS(wgs.84))

coast  <- readOGR(dsn=".",layer="ne_10m_coastline",p4s=wgs.84)
coast.moll <- spTransform(coast,CRS(mollweide))
point.moll <- spTransform(sp.points,CRS(mollweide))

set.seed(1)   # for reproducible example
test   <- sample(1:length(sp.points),10)  # random sample of ten points
result <- sapply(test,function(i)gDistance(point.moll[i],coast.moll))
result/1000   # distance in km
#  [1]   0.2185196   5.7132447   0.5302977  28.3381043 243.5410571 169.8712255   0.4182755  57.1516195 266.0498881 360.6789699

plot(coast)
points(sp.points[test],pch=20,col="red")

So this reads your dataset, extracts rows where Depth==0, and converts that to a SpatialPoints object. Then we read the coastlines database downloaded from the link above into a SpatialLines object. Then we transform both to the Mollweide projection using spTransform(...), then we use gDistance(...) in the rgeos package to calculate the minimum distance between each point and the nearest coast.

Again, it is important to remember that despite all the decimal places, these distances are just approximate.

One very big problem is speed: this process takes ~ 2 min for 1000 distances (on my system), so to run all 200,000 distances would take about 6.7 hours. One option, theoretically, would be to find a coastline database with a lower resolution.

The code below will calculate all 201,000 distances.

## not run
## estimated run time ~ 7 hours
result <- sapply(1:length(sp.points), function(i)gDistance(sp.points[i],coast))

EDIT: OP's comment about the cores got me to thinking that this could be an instance where the improvement from parallelization might be worth the effort. So here is how you would run this (on Windows) using parallel processing.

library(foreach)   # for foreach(...)
library(snow)      # for makeCluster(...)
library(doSNOW)    # for resisterDoSNOW(...)

cl <- makeCluster(4,type="SOCK")  # create a 4-processor cluster
registerDoSNOW(cl)                # register the cluster

get.dist.parallel <- function(n) {
  foreach(i=1:n, .combine=c, .packages="rgeos", .inorder=TRUE, 
          .export=c("point.moll","coast.moll")) %dopar% gDistance(point.moll[i],coast.moll)
}
get.dist.seq <- function(n) sapply(1:n,function(i)gDistance(point.moll[i],coast.moll))

identical(get.dist.seq(10),get.dist.parallel(10))  # same result?
# [1] TRUE
library(microbenchmark)  # run "benchmark"
microbenchmark(get.dist.seq(1000),get.dist.parallel(1000),times=1)
# Unit: seconds
#                     expr       min        lq      mean    median        uq       max neval
#       get.dist.seq(1000) 140.19895 140.19895 140.19895 140.19895 140.19895 140.19895     1
#  get.dist.parallel(1000)  50.71218  50.71218  50.71218  50.71218  50.71218  50.71218     1

Using 4 cores improves processing speed by about a factor of 3. So, since 1000 distances takes about a minute, 100,000 should take a little less than 2 hours.

Note that using times=1 is an abuse of microbenchmark(...) really, as the whole point is to run the process multiple times and average the results, but I just didn't have the patience.

like image 107
jlhoward Avatar answered Oct 14 '22 20:10

jlhoward