I am continuing to work on some data cleaning practice with some animal shelter data. My goal here is to shrink down the number of breed categories.
I am using each breed category as a partial pattern match against the outgoing$Single.Breed data frame column. So, there are cases where the breed will just be Chihuahua, but it may also be Long Hair Chihuahua. (Hence, my use of grepl.) Thus, anything containing a breed category would be represented in a different column by said category. Furthermore, I also need to add the cat breed categories...making for an even messier bunch of code.
The code below is my "solution", but it's quite clunky. Is there a better, slicker and/or more efficient way to accomplish this?
BreedCategories <- ifelse(outgoing$New.Type == "Dog",
ifelse(grepl("Chihuahua",outgoing$Single.Breed, ignore.case = TRUE), "Chihuahua",
ifelse(grepl("Pit Bull",outgoing$Single.Breed, ignore.case = TRUE), "Pit Bull",
ifelse(grepl("Terrier",outgoing$Single.Breed, ignore.case = TRUE), "Terrier",
ifelse(grepl("Shepherd",outgoing$Single.Breed, ignore.case = TRUE), "Shepherd",
ifelse(grepl("Poodle",outgoing$Single.Breed, ignore.case = TRUE), "Poodle",
ifelse(grepl("Labrador|Retriever",outgoing$Single.Breed, ignore.case = TRUE),"Labrador",
"Other")))))),"Cat")
Create a data.frame that maps between the regular expression and what the breed is
map <- data.frame(
pattern=c(
"Chihuahua", "Pit Bull", "Terrier", "Shepherd",
"Poodle", "Labrador|Retriever", "Other"),
isa=c(
"Chihuahua", "Pit Bull", "Terrier", "Shepherd",
"Poodle", "Labrador", "Other"),
stringsAsFactors=FALSE)
and some data
outgoing <- data.frame(Single.Breed=c(map$isa, "Pit Bull Poodle", "Pug"),
stringsAsFactors=FALSE)
For the program, use vapply() and grepl() to match each pattern to the data; the use of grepl() means that the result is a matrix, with rows corresponding to each entry
isa <- vapply(map$pattern, grepl, logical(nrow(outgoing)), outgoing$Single.Breed)
if (any(rowSums(isa) > 1))
warning("ambiguous breeds: ", outgoing$Single.Breed[rowSums(isa) != 1])
Use max.col() to visit each row and retrieve the best (last) match (which happens to be 'Other' if there are no matches).
outgoing$BreedCategory <- map$isa[max.col(isa, "last")]
Here's the result
> isa <- vapply(map$pattern, grepl, logical(nrow(outgoing)), outgoing$Single.Breed)
> if (any(rowSums(isa) > 1))
+ warning("ambiguous breeds: ", outgoing$Single.Breed[rowSums(isa) != 1])
Warning message:
ambiguous breeds: Pit Bull Poodle
> outgoing$BreedCategory <- map$isa[max.col(isa, "last")]
> outgoing
Single.Breed BreedCategory
1 Chihuahua Chihuahua
2 Pit Bull Pit Bull
3 Terrier Terrier
4 Shepherd Shepherd
5 Poodle Poodle
6 Labrador Labrador
7 Other Other
8 Pit Bull Poodle Poodle
9 Pug Other
I guess the approach is appealing because it more clearly separates the 'data' (regex and input breeds) from the 'program' (grepl() and max.col()).
The handling of 'Other' seems a little fragile -- what if you forget that it is supposed to be the last element of map? One possibility is to create an indicator variable that tests the row sums of isa, and uses this to conditionally assign the breed
test = rowSums(isa)
outgoing$BreedCategory[test == 0] = "Other"
outgoing$BreedCategory[test == 1] = map$isa[max.col(isa)][test == 1]
outgoing$BreedCategory[test > 1] = "Mixed"
The above is not very space efficient (the matrix transforms your length n data to an n x # of regex matrix), but seems likely to get the job done for say 1M input rows.
dplyr::case_when() seems to require that you write many grepl() statements, which is error prone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With