Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to grep in dplyr with mutate

Tags:

r

dplyr

I'd like some help understanding what is going on in my dplyr pipe and am requesting various solutions to this problem.

Problem

I have a list of institutes (the formal term for where the authors from papers hail from for a research journal article) and I'd like to extract the main institute name. If it's a university, it will be Univ. of XX and that is the example I'm sticking to here for simplicity.

Attempted Solution Logic

  1. Split the institute name by comma
  2. grep for the term "univ" or other list of university-related terms
  3. extract the index where there is a hit

Edge cases / assumptions

  • The term I am searching for exists in only one of the splits
  • all institutes here are universities (keeping the problem simple here for Stack Overflow)

Code

df %>%
mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1]) %>%
 head()

What I assume is happening but is not happening is the logic I wrote above. What I see happening is that within mutate, the first instance of institute is being searched for EVERY row in df and the exact same "university of new so~" is filling in. I have a general idea for what the mistake is except no idea why it's happening or how to fix it while keeping dplyr. If I use an apply function I can do this and I'm curious what SO answers there are.

What it Looks Like:

# A tibble: 6 x 2
  institute                                                                          instGuess              
  <chr>                                                                              <chr>                  
1 school of computer science and engineering, university of new south wales, sydney~ " university of new so~
2 department computer science, friedrich-alexander-university, erlangen-nuremberg, ~ " university of new so~
3 department of ece, pesit, bangalore, india                                         " university of new so~
4 school of information technology and electrical engineering, university of queens~ " university of new so~
5 school of information technology and electrical engineering, university of queens~ " university of new so~
6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ri~ " university of new so~

Data Used for Example

df <- structure(list(institute = c("school of computer science and engineering, university of new south wales, sydney, australia", 
"department computer science, friedrich-alexander-university, erlangen-nuremberg, germany", 
"department of ece, pesit, bangalore, india", "school of information technology and electrical engineering, university of queenslandqld, australia", 
"school of information technology and electrical engineering, university of queenslandold, australia", 
"dept. of info. syst. and comp. sci., national university of singapore, 10 kent ridge crescent, singapore 119260, singapore"
), instGuess = c(" university of new south wales", " university of new south wales", 
" university of new south wales", " university of new south wales", 
" university of new south wales", " university of new south wales"
)), .Names = c("institute", "instGuess"), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
like image 981
Kamil Avatar asked Dec 24 '22 07:12

Kamil


2 Answers

You need to include a group_by for your syntax to work:

df %>%
  group_by(institute) %>%
  mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1])

Produces:

# A tibble: 6 x 2
# Groups:   institute [6]
institute                                                                  instGuess              
<chr>                                                                      <chr>                  
  1 school of computer science and engineering, university of new south wales… " university of new so…
2 department computer science, friedrich-alexander-university, erlangen-nur… " friedrich-alexander-…
3 department of ece, pesit, bangalore, india                                 NA                     
4 school of information technology and electrical engineering, university o… " university of queens…
5 school of information technology and electrical engineering, university o… " university of queens…
6 dept. of info. syst. and comp. sci., national university of singapore, 10… " national university …
like image 200
Pdubbs Avatar answered Dec 25 '22 22:12

Pdubbs


I think @Pdubbs' answer is the first best, where he uses group_by to mimic @www's answer that uses rowwise(), but the difference (and in my mind clear advantage) is that when there are repeats of $institute, there is efficiency gained by only doing this guess once per institute.

This goes one step further and does not re-strsplit on each instance. I'll duplicate the first row:

df <- df[c(1,1:6),]

define a function that does the work, not duplicating strsplit:

find_univ <- function(x) {
  message('*', appendLF=FALSE)
  y <- strsplit(x[[1]], ',')[[1]]
  y[grep('univ', y)][1]
}

(and inserting a message call to indicate how many times it is called ... do not include in production), then the sequence:

df %>%
  group_by(institute) %>%
  mutate(instGuess = find_univ(institute)) %>%
  ungroup() %>%
  select(instGuess) # for display purposes only
# ******  <---- six calls on seven rows, benefit of group_by
# A tibble: 7 × 1
#                           instGuess
#                               <chr>
# 1     university of new south wales
# 2     university of new south wales
# 3    friedrich-alexander-university
# 4                              <NA>
# 5       university of queenslandqld
# 6       university of queenslandold
# 7  national university of singapore

I don't know if this de-duplication of strsplit is impactful, though it is only beneficial if you have a large amount of data. Otherwise, it's just an OCD-level of efficiency without "premature optimization".

like image 34
r2evans Avatar answered Dec 25 '22 22:12

r2evans