I have a large dataset with three variables (State, Zipcode, Name). Here a small extraction:
zz <- "State Zipcode Name
IL 60693 THISISTHEFIRST
IL 60693 TISISTHEFIRS
OH 45271 THISISTHEFIRST
CA 94085 THISISTHESECOND
CA 94085 THISISTHESECOND
CA 94085 THISISTHESECCOND
SC 29645 THISISTHETHIRD
SC 29645 THISISTHETHIRD
SC 29645 THISISTHETHIRD
SC 29645 THISISTHEFOURTH
SC 29645 ISISTHEFOURTH"
Data <- read.table(text=zz, header = TRUE)
I need to create a unique ID for observations characterized by the same State, Zipcode, Name. Some of the names are however misspelled even though they do actually represent the same subject (e.g. THISISTHEFIRST vs. TISISTHEFIRS)
I would like to end up with something that looks like this:
State Zipcode Name ID
IL 60693 THISISTHEFIRST 1
IL 60693 TISISTHEFIRS 1
OH 45271 THISISTHEFIRST 2
CA 94085 THISISTHESECOND 3
CA 94085 THISISTHESECOND 3
CA 94085 THISISTHESECCOND 3
WI 53022 THISISTHETHIRD 4
WI 53022 THISISTHETHIRD 4
WI 53022 THISISTHETHIRD 4
SC 29645 THISISTHEFOURTH 5
SC 29645 ISISTHEFOURTH 5
How could I create the unique ID in a fast and working way?
You could do something like this with agrep
using fuzzy matching. You can play with the edit distance.
Data$bins <- sapply(Data$Name, function(n)
paste(as.integer(agrepl(n, Data$Name, max.distance = 2)), collapse=""))
Data$Group <- as.integer(as.factor(Data$bins))
# State Zipcode Name bins Group
# 1 IL 60693 THISISTHEFIRST 11100000000 4
# 2 IL 60693 TISISTHEFIRS 11100000000 4
# 3 OH 45271 THISISTHEFIRST 11100000000 4
# 4 CA 94085 THISISTHESECOND 00011100000 3
# 5 CA 94085 THISISTHESECOND 00011100000 3
# 6 CA 94085 THISISTHESECCOND 00011100000 3
# 7 SC 29645 THISISTHETHIRD 00000011100 2
# 8 SC 29645 THISISTHETHIRD 00000011100 2
# 9 SC 29645 THISISTHETHIRD 00000011100 2
# 10 SC 29645 THISISTHEFOURTH 00000000011 1
# 11 SC 29645 ISISTHEFOURTH 00000000011 1
This will get you to your solution in a similar way:
Data$Group <- group(Data[,'Name'])
Data$ID <- getanID(Data, c('State', 'Zipcode', 'Group'))[,'.id', with=F]
Data[,!names(Data) %in% 'Group']
# State Zipcode Name .id
# 1 IL 60693 THISISTHEFIRST 1
# 2 IL 60693 TISISTHEFIRS 2
# 3 OH 45271 THISISTHEFIRST 1
# 4 CA 94085 THISISTHESECOND 1
# 5 CA 94085 THISISTHESECOND 2
# 6 CA 94085 THISISTHESECCOND 3
# 7 SC 29645 THISISTHETHIRD 1
# 8 SC 29645 THISISTHETHIRD 2
# 9 SC 29645 THISISTHETHIRD 3
# 10 SC 29645 THISISTHEFOURTH 1
# 11 SC 29645 ISISTHEFOURTH 2
It uses a function called group
that is built similar to the LegalizeIt's approach. Also a function from the splitstackshape
called getanID
.
group <- function(vec, maxdist = 2) {
dist <- sapply(vec, adist, vec) <= maxdist
nums <- apply(as.matrix(dist), 1, function(x) paste(as.integer(x), collapse=''))
as.integer(factor(nums))
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With