Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Sub Function in R

Tags:

regex

r

I am reading a csv file "dopers" in R.

dopers <- read.csv(file="generalDoping_alldata2.csv", head=TRUE,sep=",")

After reading the file, I have to do some data cleanup. For instance in the country column if it says

"United States" or "United State"

I would like to replace it with "USA"

I want to make sure that, if the word is " United States " or "United State ", even them my code should work. What I want to say is that even if there is any character before and after "United States" it is replaced with "USA". I understand we can use sub() function for that purpose. I was looking online and found this, however I do not understand what "^" "&" "*" "." does. Can someone please explain.

dopers$Country = sub("^UNITED STATES.*$", "USA", dopers$Country)
like image 219
nasia jaffri Avatar asked Oct 12 '13 16:10

nasia jaffri


1 Answers

Given your examples,

s <- c(" United States", " United States ", "United States ")

You can define a regular expression pattern that matches them by

pat <- "^.*United State.*$"

Here, the ^ represents the beginning and $ the end of the string, while . stands for any character and * defines a repetition (zero to any). You can experiment with modified patterns, such as

pat <- "^[ ]*United State[ ]*$" # only ignores spaces
pat <- "^.*(United State|USA).*$" # only matches "  USA" etc.

The substitution is then performed by

gsub(pat, "USA", s)
like image 101
Karsten W. Avatar answered Oct 01 '22 02:10

Karsten W.