How to create a unique ID disambiguating strings in r?

Question

I have a large dataset with three variables (State, Zipcode, Name). Here a small extraction:

zz <- "State Zipcode Name
IL  60693 THISISTHEFIRST  
IL 60693 TISISTHEFIRS    
OH  45271 THISISTHEFIRST  
CA 94085 THISISTHESECOND  
CA 94085 THISISTHESECOND  
CA 94085 THISISTHESECCOND 
SC 29645 THISISTHETHIRD  
SC 29645 THISISTHETHIRD  
SC 29645 THISISTHETHIRD  
SC 29645 THISISTHEFOURTH  
SC 29645 ISISTHEFOURTH"

Data <- read.table(text=zz, header = TRUE)

I need to create a unique ID for observations characterized by the same State, Zipcode, Name. Some of the names are however misspelled even though they do actually represent the same subject (e.g. THISISTHEFIRST vs. TISISTHEFIRS)

I would like to end up with something that looks like this:

State Zipcode Name ID
IL 60693 THISISTHEFIRST 1
IL 60693 TISISTHEFIRS 1
OH 45271 THISISTHEFIRST 2
CA 94085 THISISTHESECOND 3
CA 94085 THISISTHESECOND 3
CA 94085 THISISTHESECCOND 3
WI 53022 THISISTHETHIRD 4
WI 53022 THISISTHETHIRD 4
WI 53022 THISISTHETHIRD 4
SC 29645 THISISTHEFOURTH 5
SC 29645 ISISTHEFOURTH 5

How could I create the unique ID in a fast and working way?

Rorschach · Accepted Answer

You could do something like this with agrep using fuzzy matching. You can play with the edit distance.

Data$bins <- sapply(Data$Name, function(n)
    paste(as.integer(agrepl(n, Data$Name, max.distance = 2)), collapse=""))
Data$Group <- as.integer(as.factor(Data$bins))

#    State Zipcode             Name        bins Group
# 1     IL   60693   THISISTHEFIRST 11100000000     4
# 2     IL   60693     TISISTHEFIRS 11100000000     4
# 3     OH   45271   THISISTHEFIRST 11100000000     4
# 4     CA   94085  THISISTHESECOND 00011100000     3
# 5     CA   94085  THISISTHESECOND 00011100000     3
# 6     CA   94085 THISISTHESECCOND 00011100000     3
# 7     SC   29645   THISISTHETHIRD 00000011100     2
# 8     SC   29645   THISISTHETHIRD 00000011100     2
# 9     SC   29645   THISISTHETHIRD 00000011100     2
# 10    SC   29645  THISISTHEFOURTH 00000000011     1
# 11    SC   29645    ISISTHEFOURTH 00000000011     1

Pierre L · Answer

This will get you to your solution in a similar way:

Data$Group <- group(Data[,'Name'])
Data$ID <- getanID(Data, c('State', 'Zipcode', 'Group'))[,'.id', with=F]
Data[,!names(Data) %in% 'Group']
#    State Zipcode             Name .id
# 1     IL   60693   THISISTHEFIRST   1
# 2     IL   60693     TISISTHEFIRS   2
# 3     OH   45271   THISISTHEFIRST   1
# 4     CA   94085  THISISTHESECOND   1
# 5     CA   94085  THISISTHESECOND   2
# 6     CA   94085 THISISTHESECCOND   3
# 7     SC   29645   THISISTHETHIRD   1
# 8     SC   29645   THISISTHETHIRD   2
# 9     SC   29645   THISISTHETHIRD   3
# 10    SC   29645  THISISTHEFOURTH   1
# 11    SC   29645    ISISTHEFOURTH   2

It uses a function called group that is built similar to the LegalizeIt's approach. Also a function from the splitstackshape called getanID.

group <- function(vec, maxdist = 2) {
  dist <- sapply(vec, adist, vec) <= maxdist
  nums <- apply(as.matrix(dist), 1, function(x) paste(as.integer(x), collapse=''))
  as.integer(factor(nums))
}

How to create a unique ID disambiguating strings in r?

Tags:

database

r

Bob

2 Answers

Rorschach

Pierre L

Recent Activity

Donate For Us

How to create a unique ID disambiguating strings in r?

Tags:

database

r

Bob

2 Answers

Rorschach

Pierre L

Related questions

Recent Activity

Donate For Us