I am using a "messy" set of data where there was no control on variable input during the data entry process. I need to have uniformity in my categories in order to proceed with my analysis, and I dread having to manually clean the data. An example set looks like this:
Name<-c("Goat","goat","BillyGoat"," Billy Goat", "Billy.Goat","Bilygoat","Billy-Goat", 'Goat', "Billy/Goat","Billy*Goat",
"Dog","DOG","Dogs"," Dogs", " Dogs","Dogs ", "DVD","D.V.D",
"XYZ","XZY","Champlain","Chaplain","LakeChamplain","Lake Champlain")
Number<-seq(1,24)
DF<-data.frame(Name,Number)
I have capitalization issues, extra spaces added, inconsistent use of special characters (periods, hyphens, etc), and some obvious spelling issues.
It's pretty easy to take care of the first two issues by making everything lower case and removing all spaces:
DF$Name<-tolower(DF$Name)
DF$Name<-gsub(" ","",DF$Name)
But given my actual dataset is pretty massive, I'd like to avoid manually cleaning up spelling and other issues with my data. Given this is a common problem in data science, are there any R resources I can use to clean up this kind of messy data?
You can use clean_names() from janitor package.
DF <- DF %>% clean_names()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With