Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proper use of gsub / regular expressions in R?

Tags:

regex

list

r

gsub

I have long lists of strings such as this machine readable example:

A <- list(c("Biology","Cell Biology","Art","Humanities, Multidisciplinary; Psychology, Experimental","Astronomy & Astrophysics; Physics, Particles & Fields","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods","Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science","Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"))  

So it looks like this:

> A  
[[1]]  
 [1] "Biology"  
 [2] "Cell Biology"  
 [3] "Art"  
 [4] "Humanities, Multidisciplinary; Psychology, Experimental"  
 [5] "Astronomy & Astrophysics; Physics, Particles & Fields"  
 [6] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods"  
 [7] "Geriatrics & Gerontology"  
 [8] "Gerontology"  
 [9] "Management"  
[10] "Operations Research & Management Science"  
[11] "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic"  
[12] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"  

I would like to edit these terms and eliminate duplicates in order to get this result:

 [1] "Science"  
 [2] "Science"  
 [3] "Arts & Humanities"  
 [4] "Arts & Humanities; Social Sciences"  
 [5] "Science"  
 [6] "Social Sciences; Science"  
 [7] "Science"  
 [8] "Social Sciences"  
 [9] "Social Sciences"  
[10] "Science"  
[11] "Science"  
[12] "Social Sciences; Science"  

So far I only got this:

stringedit <- function(A)  
{  
  A <-gsub("Biology", "Science", A)  
  A <-gsub("Cell Biology", "Science", A)  
  A <-gsub("Art", "Arts & Humanities", A)  
  A <-gsub("Humanities, Multidisciplinary", "Arts & Humanities", A)  
  A <-gsub("Psychology, Experimental", "Social Sciences", A)  
  A <-gsub("Astronomy & Astrophysics", "Science", A)  
  A <-gsub("Physics, Particles & Fields", "Science", A)  
  A <-gsub("Economics", "Social Sciences", A)  
  A <-gsub("Mathematics", "Science", A)  
  A <-gsub("Mathematics, Applied", "Science", A)  
  A <-gsub("Mathematics, Interdisciplinary Applications", "Science", A)  
  A <-gsub("Social Sciences, Mathematical Methods", "Social Sciences", A)  
  A <-gsub("Geriatrics & Gerontology", "Science", A)  
  A <-gsub("Gerontology", "Social Sciences", A)  
  A <-gsub("Management", "Social Sciences", A)  
  A <-gsub("Operations Research & Management Science", "Science", A)  
  A <-gsub("Computer Science, Artificial Intelligence", "Science", A)  
  A <-gsub("Computer Science, Information Systems", "Science", A)  
  A <-gsub("Engineering, Electrical & Electronic", "Science", A)  
  A <-gsub("Statistics & Probability", "Science", A)  
}  
B <- lapply(A, stringedit)  

But it does not work properly:

> B  
[[1]]  
 [1] "Science"  
 [2] "Cell Science"  
 [3] "Arts & Humanities"  
 [4] "Arts & Humanities; Social Sciences"  
 [5] "Science; Science"  
 [6] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences"  
 [7] "Science"  
 [8] "Social Sciences"  
 [9] "Social Sciences"  
[10] "Operations Research & Social Sciences Science"  
[11] "Computer Science, Arts & Humanitiesificial Intelligence; Science; Science"  
[12] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences; Science"  

How can I achieve the correct output mentioned above?
Thank you very much in advance for your consideration!

like image 628
user1496104 Avatar asked Oct 22 '12 10:10

user1496104


People also ask

How do I GSUB a column in R?

To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.

What does \r mean regex?

Definition and Usage The \r metacharacter matches carriage return characters.

How do you make a regular expression in R?

To create that regular expression, you need to use a string, which also needs to escape \ . That means to match a literal \ you need to write "\\\\" — you need four backslashes to match one!

What is use of$ in regular expression in R language?

For this purpose, we use the two main anchors in R regular expressions: ^ – matches from the beginning of the string (for multiline strings – the beginning of each line) $ – matches from the end of the string (for multiline strings – the end of each line)


1 Answers

I found it easiest to have a two-column data.frame as a lookup, with one column for the course name and one column for the category. Here's an example:

course.categories <- data.frame(
  Course = 
  c("Art", "Humanities, Multidisciplinary", "Biology", "Cell Biology", 
    "Astronomy & Astrophysics", "Physics, Particles & Fields", "Mathematics", 
    "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", 
    "Geriatrics & Gerontology", "Operations Research & Management Science", 
    "Computer Science, Artificial Intelligence", 
    "Computer Science, Information Systems", 
    "Engineering, Electrical & Electronic", "Statistics & Probability", 
    "Psychology, Experimental", "Economics", 
    "Social Sciences, Mathematical Methods", 
    "Gerontology", "Management"),
  Category =
  c("Arts & Humanities", "Arts & Humanities", "Science", "Science", 
    "Science", "Science", "Science", "Science", "Science", "Science", 
    "Science", "Science", "Science", "Science", "Science", "Social Sciences", 
    "Social Sciences", "Social Sciences", "Social Sciences", "Social Sciences"))

Then, assuming A as a list as in your question:

sapply(strsplit(unlist(A), "; "), 
       function(x) 
         paste(unique(course.categories[match(x, course.categories[["Course"]]),
                                        "Category"]), 
               collapse = "; "))
#  [1] "Science"                            "Science"                           
#  [3] "Arts & Humanities"                  "Arts & Humanities; Social Sciences"
#  [5] "Science"                            "Social Sciences; Science"          
#  [7] "Science"                            "Social Sciences"                   
#  [9] "Social Sciences"                    "Science"                           
# [11] "Science"                            "Social Sciences; Science"

match matches the values from A with the course names in the course.categories dataset and says which rows the match occurs on; this is used to extract the category the course belongs to. Then, unique makes sure we just have one of each category. paste puts things back together.

like image 177
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 28 '22 18:09

A5C1D2H2I1M1N2O1R2T1