Proper use of gsub / regular expressions in R?

Tags:

I have long lists of strings such as this machine readable example:

A <- list(c("Biology","Cell Biology","Art","Humanities, Multidisciplinary; Psychology, Experimental","Astronomy & Astrophysics; Physics, Particles & Fields","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods","Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science","Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"))

So it looks like this:

> A  
[[1]]  
 [1] "Biology"  
 [2] "Cell Biology"  
 [3] "Art"  
 [4] "Humanities, Multidisciplinary; Psychology, Experimental"  
 [5] "Astronomy & Astrophysics; Physics, Particles & Fields"  
 [6] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods"  
 [7] "Geriatrics & Gerontology"  
 [8] "Gerontology"  
 [9] "Management"  
[10] "Operations Research & Management Science"  
[11] "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic"  
[12] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"

I would like to edit these terms and eliminate duplicates in order to get this result:

 [1] "Science"  
 [2] "Science"  
 [3] "Arts & Humanities"  
 [4] "Arts & Humanities; Social Sciences"  
 [5] "Science"  
 [6] "Social Sciences; Science"  
 [7] "Science"  
 [8] "Social Sciences"  
 [9] "Social Sciences"  
[10] "Science"  
[11] "Science"  
[12] "Social Sciences; Science"

So far I only got this:

stringedit <- function(A)  
{  
  A <-gsub("Biology", "Science", A)  
  A <-gsub("Cell Biology", "Science", A)  
  A <-gsub("Art", "Arts & Humanities", A)  
  A <-gsub("Humanities, Multidisciplinary", "Arts & Humanities", A)  
  A <-gsub("Psychology, Experimental", "Social Sciences", A)  
  A <-gsub("Astronomy & Astrophysics", "Science", A)  
  A <-gsub("Physics, Particles & Fields", "Science", A)  
  A <-gsub("Economics", "Social Sciences", A)  
  A <-gsub("Mathematics", "Science", A)  
  A <-gsub("Mathematics, Applied", "Science", A)  
  A <-gsub("Mathematics, Interdisciplinary Applications", "Science", A)  
  A <-gsub("Social Sciences, Mathematical Methods", "Social Sciences", A)  
  A <-gsub("Geriatrics & Gerontology", "Science", A)  
  A <-gsub("Gerontology", "Social Sciences", A)  
  A <-gsub("Management", "Social Sciences", A)  
  A <-gsub("Operations Research & Management Science", "Science", A)  
  A <-gsub("Computer Science, Artificial Intelligence", "Science", A)  
  A <-gsub("Computer Science, Information Systems", "Science", A)  
  A <-gsub("Engineering, Electrical & Electronic", "Science", A)  
  A <-gsub("Statistics & Probability", "Science", A)  
}  
B <- lapply(A, stringedit)

But it does not work properly:

> B  
[[1]]  
 [1] "Science"  
 [2] "Cell Science"  
 [3] "Arts & Humanities"  
 [4] "Arts & Humanities; Social Sciences"  
 [5] "Science; Science"  
 [6] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences"  
 [7] "Science"  
 [8] "Social Sciences"  
 [9] "Social Sciences"  
[10] "Operations Research & Social Sciences Science"  
[11] "Computer Science, Arts & Humanitiesificial Intelligence; Science; Science"  
[12] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences; Science"

How can I achieve the correct output mentioned above?
Thank you very much in advance for your consideration!

628

asked Oct 22 '12 10:10

user1496104

1 Answers

I found it easiest to have a two-column data.frame as a lookup, with one column for the course name and one column for the category. Here's an example:

course.categories <- data.frame(
  Course = 
  c("Art", "Humanities, Multidisciplinary", "Biology", "Cell Biology", 
    "Astronomy & Astrophysics", "Physics, Particles & Fields", "Mathematics", 
    "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", 
    "Geriatrics & Gerontology", "Operations Research & Management Science", 
    "Computer Science, Artificial Intelligence", 
    "Computer Science, Information Systems", 
    "Engineering, Electrical & Electronic", "Statistics & Probability", 
    "Psychology, Experimental", "Economics", 
    "Social Sciences, Mathematical Methods", 
    "Gerontology", "Management"),
  Category =
  c("Arts & Humanities", "Arts & Humanities", "Science", "Science", 
    "Science", "Science", "Science", "Science", "Science", "Science", 
    "Science", "Science", "Science", "Science", "Science", "Social Sciences", 
    "Social Sciences", "Social Sciences", "Social Sciences", "Social Sciences"))

Then, assuming A as a list as in your question:

sapply(strsplit(unlist(A), "; "), 
       function(x) 
         paste(unique(course.categories[match(x, course.categories[["Course"]]),
                                        "Category"]), 
               collapse = "; "))
#  [1] "Science"                            "Science"                           
#  [3] "Arts & Humanities"                  "Arts & Humanities; Social Sciences"
#  [5] "Science"                            "Social Sciences; Science"          
#  [7] "Science"                            "Social Sciences"                   
#  [9] "Social Sciences"                    "Science"                           
# [11] "Science"                            "Social Sciences; Science"

match matches the values from A with the course names in the course.categories dataset and says which rows the match occurs on; this is used to extract the category the course belongs to. Then, unique makes sure we just have one of each category. paste puts things back together.

177

answered Sep 28 '22 18:09

A5C1D2H2I1M1N2O1R2T1

Related questions
                            
                                analyze-string does not match when regex is a variable
                            
                                How to truncate an email local-part to '[email protected]'
                            
                                Find longest series of ones in a binary digit array
                            
                                REGEX min. 4 chars, max 11, allow space and special chars
                            
                                Regex string issue in making plain text urls clickable
                            
                                perl, append a character to i-th capturing group
                            
                                How to choose between whitespace pattern?
                            
                                Is there a regex to match empty string and a given word at the same time?
                            
                                How to get PLY to ignore case of a regular expression?
                            
                                Remove empty lines from string, but allow one empty between every line
                            
                                Java: regex - how do i get the first quote text
                            
                                Regex pattern numbers followed by a character
                            
                                Regex for splitting a string delimited by | when not enclosed on double quotes
                            
                                RegEx for 5 digit zip or empty
                            
                                SpringMongo Case insensitive search regex
                            
                                How do you extract text matching a pattern in XPATH?
                            
                                How do I find the offset of a matching string using RE2?
                            
                                Regex pattern and/or NSRegularExpression a bit too slow searching over very large file, can it be optimized?
                            
                                Odd JavasScript string replace behavior with $&
                            
                                Regexp-replace: Multiple replacements within a match

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Proper use of gsub / regular expressions in R?

Tags:

regex

list

r

gsub

user1496104

People also ask

1 Answers

A5C1D2H2I1M1N2O1R2T1

Recent Activity

Donate For Us