I am reading a csv file "dopers
" in R.
dopers <- read.csv(file="generalDoping_alldata2.csv", head=TRUE,sep=",")
After reading the file, I have to do some data cleanup. For instance in the country
column if it says
"United States" or "United State"
I would like to replace it with "USA"
I want to make sure that, if the word is " United States "
or "United State "
, even them my code should work. What I want to say is that even if there is any character before and after "United States"
it is replaced with "USA"
. I understand we can use sub()
function for that purpose. I was looking online and found this, however I do not understand what "^" "&" "*" "."
does. Can someone please explain.
dopers$Country = sub("^UNITED STATES.*$", "USA", dopers$Country)
Given your examples,
s <- c(" United States", " United States ", "United States ")
You can define a regular expression pattern that matches them by
pat <- "^.*United State.*$"
Here, the ^
represents the beginning and $
the end of the string, while
.
stands for any character and *
defines a repetition (zero to any). You can experiment with modified patterns, such as
pat <- "^[ ]*United State[ ]*$" # only ignores spaces
pat <- "^.*(United State|USA).*$" # only matches " USA" etc.
The substitution is then performed by
gsub(pat, "USA", s)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With