I am reading a csv file "dopers" in R.
dopers <- read.csv(file="generalDoping_alldata2.csv", head=TRUE,sep=",")
After reading the file, I have to do some data cleanup. For instance in the country column if it says 
"United States" or "United State"
I would like to replace it with "USA"
I want to make sure that, if the word is "  United States    " or "United   State   ", even them my code should work. What I want to say is that even if there is any character before and after "United States" it is replaced with "USA". I understand we can use sub() function for that purpose. I was looking online and found this, however I do not understand what "^" "&" "*" "." does. Can someone please explain.
dopers$Country = sub("^UNITED STATES.*$", "USA", dopers$Country)
                Given your examples,
s <- c(" United States", " United States ", "United States ")
You can define a regular expression pattern that matches them by
pat <- "^.*United State.*$"
Here, the ^ represents the beginning and $ the end of the string, while
. stands for any character and * defines a repetition (zero to any). You can experiment with modified patterns, such as
pat <- "^[ ]*United State[ ]*$" # only ignores spaces
pat <- "^.*(United State|USA).*$" # only matches "  USA" etc.
The substitution is then performed by
gsub(pat, "USA", s)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With