I am working with Uranium-Lead geochronological datasets. This type of data is typically interpreted in what is called a "Wetherill" concordia diagram, an x-y coordinate space where each axis contains Pb/U ratios that correspond to one age estimate. Given that both axes have independent age estimates, but the isotopic systems decay at different rates, the line of equal age (concordia) in this space has curvature to it.
Geochronologists interpret data arrays in this space where the relevant age information is contained in where arrays intersect the concordia line, the curved line of equal age.
Typical interpretations rely on single data arrays following a single line, which is readily regressed to find intersections with the concordia line, and uncertainty is robustly propagated on those regressions.
However, more complex datasets can be difficult to interpret.
The fundamental information in more complex scenarios can be represented by a polygon, where the corners of the polygon must fall on the concordia curve - these are the points that contain the age information. In essence, we can interpret the boundary lines in a dense data array as containing the geochronological information.
I am dealing with very large n datasets (>1 million data points), and I am working on a method to identify the polygon shape that contains the largest fraction of the data, but also tightly defines the outer limit of the data - in other words the optimal polygon would not be simply the largest polygon.
Below I generate some synthetic data that corresponds to a complex data array - the scenario modeled contains a large population of mineral grains that grew between 3 billion upper.int.age.max
years ago and 2.5 billion years ago upper.int.age.min
. These grains then experienced an alteration event 1 billion years ago lower.int.age.1
, and another alteration event 100 million years ago lower.int.age.2
.
## set seed
set.seed( 12 )
## now create ages for a model dataset events
lower.int.age.1 <- 1000
lower.int.age.2 <- 100
upper.int.age.min <- 2500
upper.int.age.max <- 3000
## a couple of required equations
ratio735 <- function(age) { ## Age in years
exp( 9.8485e-10 * age * 1e6 ) - 1
}
ratio638 <- function(age) { ## Age in years
exp( 1.55125e-10 * age * 1e6 ) - 1
}
## calculate x and y values from the age information
x.values <- c( ratio735(lower.int.age.1),
ratio735(lower.int.age.2) ) # calculate 7/35 ratios
y.values <- c( ratio638(lower.int.age.1),
ratio638(lower.int.age.2) ) # calculate 6/38 ratios
## now make a data frame of example data
data.test <- data.frame( upper.age = runif( 5000, min =
upper.int.age.min, max = upper.int.age.max ),
prop.1 = runif( 5000, min = 0.05, max = 1 ),
prop.2 = runif( 5000, min = 0.05, max = 1 ) )
data.test$x1 <- ( ( ratio735( data.test$upper.age ) - x.values[1]
) * ( data.test$prop.1 ) ) + x.values[1]
data.test$y1 <- ( ( ratio638( data.test$upper.age ) - y.values[1]
) * ( data.test$prop.1 ) ) + y.values[1]
data.test$x2 <- ( ( data.test$x1 - x.values[2] ) * (
data.test$prop.2 ) ) + x.values[2]
data.test$y2 <- ( ( data.test$y1 - y.values[2] ) * (
data.test$prop.2 ) ) + y.values[2]
Now here is the equation for the line of equal age, the concordia line, and generate some points along the line for plotting purposes
## line equation
concordia.x <- function(t) {
x <- exp(9.8485e-10 * t * 1e6) - 1
x
}
concordia.y <- function(t) {
y <- exp(1.55125e-10 * t * 1e6) - 1
y
}
concordia.line <- data.frame( age = seq(0,3000,1) )
concordia.line$x <- ratio735( concordia.line$age )
concordia.line$y <- ratio638( concordia.line$age )
concordia.points <- subset( concordia.line, age == lower.int.age.1
| age == lower.int.age.2 | age = upper.int.age.min | age == upper.int.age.max )
Now plot the synthetic data and the concordia line. The black points are the geological events.
## plot the data and line
ggplot( data.test, aes( x= x2, y = y2 ) ) +
geom_line( data = concordia.line, aes( x = x, y = y ) ) +
geom_point( color = "red" ) +
geom_point( data = concordia.points, aes( x = x, y = y ),
size = 4 )
Here is the resulting plot (the lower.int.age.1
etc. Ideally a model will return a polygon with the four bold points making up the corners of the polygon.
I'm currently doing this brute force, by generating a suite of hypothetical polygons with corners defined along the concordia line. I generate the polygons by calculating four sequences of possible corner values, then using expand.grid
to find all possible combinations. Then I filter that grid for the correct sequence of points.
## create polygons
lower.1 = seq( from = 0, to = 100, by = 50 ) ## corner 1
lower.2 = seq( from = 900, to = 1100, by = 100 ) ## corner 2
upper.1 = seq( from = 2400, to = 2800, by = 100 ) ## corner 3
upper.2 = seq( from = 2800, to = 3200, by = 100 ) ## corner 4
# find all combinations
age.values = expand.grid( lower.1, lower.2, upper.1, upper.2 )
colnames( age.values ) <- c( "lower.1", "lower.2", "upper.1",
"upper.2" )
##filter for polygons with a certain order to them
age.values.used <- subset( age.values, lower.1 < lower.2 & lower.2 < upper.1 & upper.1 < upper.2 )
## calculate a list of x,y values for each polygon
polygons <- list(NA)
## Fill the list with a for loop - not a great way, I know
for( i in 1:nrow( age.values.used ) ) {
data.temp <- age.values.used[ i, ]
polygons[[ i ]] <- data.frame( age = age.values.used[i,],
x = concordia.x( as.vector( t( data.temp ) ) ),
y = concordia.y( as.vector( t( data.temp ) ) ) )
}
Once I have the polygons in a list, where each element contains the four corners of the polygon, I make a function to find the fraction of data in each polygon. This uses the sp::point.in.polygon
function and just calculates the fraction of data in each polygon in the list of polygons
.
library(sp, pracma)
## calculate the fraction of data in one polygon
in.poly <- function( poly.frame, points ) { ##uses two
dataframes, one with data, one with the x and y coords for the polygon
## use sp point.in.polygon function
in.polygon <- sp::point.in.polygon( point.x = points[,1],
point.y = points[,2],
pol.x = poly.frame$x,
pol.y = poly.frame$y )
sum( in.polygon == 1 ) / length( in.polygon )
}
Now I loop through using that equation to find the fraction in each polygon.
## create a blank data frame to fill in a for loop
results.data <- data.frame( upper.1 = rep( NA, times = length(
polygons ) ), upper.2 = rep( NA, times = length(
polygons ) ), lower.2 = rep( NA, times = length(
polygons ) ), lower.1 = rep( NA, times = length(
polygons ) ), fraction.data.in = rep( NA, times =
length( polygons ) ), area = rep( NA, times = length( polygons
) ) )
Now use the for
loop to add the corner ages, the fraction of data in the polygon, and calculate the area of the polygon (using the sp:Polygon
function)
for( j in 1:length( polygons ) ) {
results.data$upper.1[ j ] = polygons[[j]]$age.upper.2[1]
results.data$upper.2[ j ] = polygons[[j]]$age.upper.2[1]
results.data$lower.2[ j ] = polygons[[j]]$age.lower.2[1]
results.data$lower.1[ j ] = polygons[[j]]$age.lower.1[1]
results.data$fraction.data.in[j] <- in.poly( polygons[[j]],
data.test[, c("x2", "y2")] )
rpoly.def <- sp::Polygon( polygons[[j]][,c("x","y")] )
results.data$area[ j ] <- slot( poly.def, "area" )
}
I can then use this data to figure out what polygons have the most data inside of them, while also having the least area to them.
This is very much a brute force method and I'm hoping to find something that is more elegant (not hard I'm sure) and runs significantly faster.
Thanks for any help you can provide, and please comment if I should explain something further.
I suggest a genetic algorithm using the GA
package.
The goal is to find a set of times within the interval
age.min <- 0
age.max <- 3000
such that the (x,y)
coordinates
compute_x_from_age <- function(t){
exp(9.8485e-10 * t * 1e6) - 1
}
compute_y_from_age <- function(t){
exp(1.55125e-10 * t * 1e6) - 1
}
enclose most of the
red_points <- data.test[, c("x2", "y2")]
while having minimal area. Call a vector of four numbers in the interval a candidate. Then
library(GA)
library(sp)
library(pracma)
library(ggplot2)
fraction_in_candidate <- function(candidate){
candidate <- sort(candidate)
in.polygon <- sp::point.in.polygon( point.x = red_points[,1],
point.y = red_points[,2],
pol.x = compute_x_from_age(candidate),
pol.y = compute_y_from_age(candidate))
sum( in.polygon == 1 ) / length( in.polygon )
}
area_of_candidate <- function(candidate){
candidate <- sort(candidate)
sp::Polygon(data.frame(compute_x_from_age(candidate),
compute_y_from_age(candidate))) |>
slot("area")
}
fitness_of_candidate <- function(candidate){
(fraction_in_candidate(candidate)^5)/sqrt(area_of_candidate(candidate))
}
You should use theoretical and empirical domain knowledge to adjust the fitness function to ensure it gives reasonable output below. My first few tries gave degenerate (extremely narrow) polygons.
Given a candidate, we can mutate it by randomly nudging a point left or right.
custom_mutation <- function(object, parent) {
child <- object@population[parent, ]
gene_to_mutate <- sample(1:4, 1)
jitter_amount <- rnorm(1, mean = 0, sd = 100)
child[gene_to_mutate] <- child[gene_to_mutate] + jitter_amount
# clip to boundary
child[gene_to_mutate] <- min(max(child[gene_to_mutate], age.min), age.max)
child
}
Given two parent candidates, we can merge the eight values together and then separate them randomly into two child candidates.
custom_crossover <- function(object, parents) {
parent1 <- object@population[parents[1], ]
parent2 <- object@population[parents[2], ]
combined_genes <- c(parent1, parent2)
children <- sample(combined_genes,8)
list(children = matrix(children, nrow = 2), fitness = c(NA, NA))
}
Then the genetic algorithm can be run with
ga_result <- ga(
type = "real-valued",
fitness = fitness_of_candidate,
lower = rep(age.min, 4),
upper = rep(age.max, 4),
popSize = 500,
maxiter = 50,
mutation = custom_mutation,
crossover = custom_crossover,
elitism = 5,
monitor = TRUE
)
# Best solution
solution <- sort(ga_result@solution[1,])
solution_df <- data.frame(x2 = compute_x_from_age(solution),
y2 = compute_y_from_age(solution))
ggplot(red_points, aes(x=x2, y = y2)) +
geom_line( data = concordia.line, aes( x = x, y = y ) ) +
geom_point(data = solution_df, color = "blue", size = 4) +
geom_polygon(data = solution_df,
color = "blue",
alpha = 0.3
) +
geom_point( color = "red" )
My uneducated guess is that the main place to focus your geochronological knowledge is in the fitness function. I believe the rest is simply about the speed of convergence.
You could try letting the mutation function occasionally add or remove a point. Then you might get non-quadrilaterals.
You could try fitting a segmented
regression, to e.g., a cummax
, and predict
y values of the estimated break points.
> xord <- order(data.test$x2)
> yc_max <- cummax(data.test$y2[xord])
> x2_ord <- data.test$x2[xord]
> fit <- lm(yc_max ~ x2_ord)
> o <- segmented::segmented(fit, seg.Z = ~x2_ord, psi = c(2, 11)) ## c(2, 11) are guessed starting values for the optimizer
> x_brk <- unname(summary(o)$psi[, "Est."]) |>
+ c(range(data.test$x2)) |> sort()
> yhat <- predict(o, newdata = data.frame(x2_ord = x_brk))
> (polygon_est <- cbind(x=x_brk, y=yhat))
x y
1 0.2181886 0.02226673
2 2.2488102 0.18087809
3 10.5529546 0.45421275
4 17.4122930 0.58586390
>
> with(data.test, plot(x2, y2, pch = '.'))
> polygon(polygon_est, border = 'red')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With