I want to create an unique ID in R based on two columns of latitude and longitude so that duplicated locations have the same cluster ID.
For example:
LAT LONG Cluster_ID 13.5330 -15.4180 1 13.5330 -15.4180 1 13.5330 -15.4180 1 13.5330 -15.4180 1 13.5330 -15.4170 2 13.5330 -15.4170 2 13.5330 -15.4170 2 13.5340 -14.9350 3 13.5340 -14.9350 3 13.5340 -15.9170 4 13.3670 -14.6190 5
GROUP BY only treats two rows as duplicates if all the column values in both the rows are the same. If even a single column value in either of the row is non-matching, they are treated as unique.
One way to find duplicate records from the table is the GROUP BY statement. The GROUP BY statement in SQL is used to arrange identical data into groups with the help of some functions. i.e if a particular column has the same values in different rows then it will arrange these rows in a group.
To select duplicate values, you need to create groups of rows with the same values and then select the groups with counts greater than one. You can achieve that by using GROUP BY and a HAVING clause.
Here's one way using interaction
.
d <- read.table(text='LAT LONG 13.5330 -15.4180 13.5330 -15.4180 13.5330 -15.4180 13.5330 -15.4180 13.5330 -15.4170 13.5330 -15.4170 13.5330 -15.4170 13.5340 -14.9350 13.5340 -14.9350 13.5340 -15.9170 13.3670 -14.6190', header=TRUE) d <- transform(d, Cluster_ID = as.numeric(interaction(LAT, LONG, drop=TRUE))) # LAT LONG Cluster_ID # 1 13.533 -15.418 2 # 2 13.533 -15.418 2 # 3 13.533 -15.418 2 # 4 13.533 -15.418 2 # 5 13.533 -15.417 3 # 6 13.533 -15.417 3 # 7 13.533 -15.417 3 # 8 13.534 -14.935 4 # 9 13.534 -14.935 4 # 10 13.534 -15.917 1 # 11 13.367 -14.619 5
EDIT: Incorporated @Spacedman's suggestion to supply drop=TRUE
to interaction
.
The data:
dat <- read.table(text=" LAT LONG 13.5330 -15.4180 13.5330 -15.4180 13.5330 -15.4180 13.5330 -15.4180 13.5330 -15.4170 13.5330 -15.4170 13.5330 -15.4170 13.5340 -14.9350 13.5340 -14.9350 13.5340 -15.9170 13.3670 -14.6190", header = TRUE)
These commands create an id variable starting with 1
:
comb <- with(dat, paste(LAT, LONG)) within(dat, Cluster_ID <- match(comb, unique(comb)))
The output:
LAT LONG Cluster_ID 1 13.533 -15.418 1 2 13.533 -15.418 1 3 13.533 -15.418 1 4 13.533 -15.418 1 5 13.533 -15.417 2 6 13.533 -15.417 2 7 13.533 -15.417 2 8 13.534 -14.935 3 9 13.534 -14.935 3 10 13.534 -15.917 4 11 13.367 -14.619 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With