Cleaning data to create consistent variable naming

Question

I am using a "messy" set of data where there was no control on variable input during the data entry process. I need to have uniformity in my categories in order to proceed with my analysis, and I dread having to manually clean the data. An example set looks like this:

Name<-c("Goat","goat","BillyGoat"," Billy Goat", "Billy.Goat","Bilygoat","Billy-Goat", 'Goat', "Billy/Goat","Billy*Goat",
        "Dog","DOG","Dogs"," Dogs", "  Dogs","Dogs  ", "DVD","D.V.D",
        "XYZ","XZY","Champlain","Chaplain","LakeChamplain","Lake Champlain")

Number<-seq(1,24)

DF<-data.frame(Name,Number)

I have capitalization issues, extra spaces added, inconsistent use of special characters (periods, hyphens, etc), and some obvious spelling issues.

It's pretty easy to take care of the first two issues by making everything lower case and removing all spaces:

DF$Name<-tolower(DF$Name)
DF$Name<-gsub(" ","",DF$Name)

But given my actual dataset is pretty massive, I'd like to avoid manually cleaning up spelling and other issues with my data. Given this is a common problem in data science, are there any R resources I can use to clean up this kind of messy data?

Nuclear241 · Accepted Answer

You can use clean_names() from janitor package.

DF <- DF %>% clean_names()

Cleaning data to create consistent variable naming

Tags:

r

data-cleaning

Vint

1 Answers

Nuclear241

Recent Activity

Donate For Us

Cleaning data to create consistent variable naming

Tags:

r

data-cleaning

Vint

1 Answers

Nuclear241

Related questions

Recent Activity

Donate For Us