I'd like some help understanding what is going on in my dplyr
pipe and am requesting various solutions to this problem.
I have a list of institutes (the formal term for where the authors from papers hail from for a research journal article) and I'd like to extract the main institute name. If it's a university, it will be Univ. of XX and that is the example I'm sticking to here for simplicity.
df %>%
mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1]) %>%
head()
What I assume is happening but is not happening is the logic I wrote above. What I see happening is that within mutate, the first instance of institute
is being searched for EVERY row in df
and the exact same "university of new so~" is filling in. I have a general idea for what the mistake is except no idea why it's happening or how to fix it while keeping dplyr
. If I use an apply
function I can do this and I'm curious what SO answers there are.
What it Looks Like:
# A tibble: 6 x 2
institute instGuess
<chr> <chr>
1 school of computer science and engineering, university of new south wales, sydney~ " university of new so~
2 department computer science, friedrich-alexander-university, erlangen-nuremberg, ~ " university of new so~
3 department of ece, pesit, bangalore, india " university of new so~
4 school of information technology and electrical engineering, university of queens~ " university of new so~
5 school of information technology and electrical engineering, university of queens~ " university of new so~
6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ri~ " university of new so~
df <- structure(list(institute = c("school of computer science and engineering, university of new south wales, sydney, australia",
"department computer science, friedrich-alexander-university, erlangen-nuremberg, germany",
"department of ece, pesit, bangalore, india", "school of information technology and electrical engineering, university of queenslandqld, australia",
"school of information technology and electrical engineering, university of queenslandold, australia",
"dept. of info. syst. and comp. sci., national university of singapore, 10 kent ridge crescent, singapore 119260, singapore"
), instGuess = c(" university of new south wales", " university of new south wales",
" university of new south wales", " university of new south wales",
" university of new south wales", " university of new south wales"
)), .Names = c("institute", "instGuess"), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
You need to include a group_by
for your syntax to work:
df %>%
group_by(institute) %>%
mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1])
Produces:
# A tibble: 6 x 2
# Groups: institute [6]
institute instGuess
<chr> <chr>
1 school of computer science and engineering, university of new south wales… " university of new so…
2 department computer science, friedrich-alexander-university, erlangen-nur… " friedrich-alexander-…
3 department of ece, pesit, bangalore, india NA
4 school of information technology and electrical engineering, university o… " university of queens…
5 school of information technology and electrical engineering, university o… " university of queens…
6 dept. of info. syst. and comp. sci., national university of singapore, 10… " national university …
I think @Pdubbs' answer is the first best, where he uses group_by
to mimic @www's answer that uses rowwise()
, but the difference (and in my mind clear advantage) is that when there are repeats of $institute
, there is efficiency gained by only doing this guess once per institute.
This goes one step further and does not re-strsplit
on each instance. I'll duplicate the first row:
df <- df[c(1,1:6),]
define a function that does the work, not duplicating strsplit
:
find_univ <- function(x) {
message('*', appendLF=FALSE)
y <- strsplit(x[[1]], ',')[[1]]
y[grep('univ', y)][1]
}
(and inserting a message
call to indicate how many times it is called ... do not include in production), then the sequence:
df %>%
group_by(institute) %>%
mutate(instGuess = find_univ(institute)) %>%
ungroup() %>%
select(instGuess) # for display purposes only
# ****** <---- six calls on seven rows, benefit of group_by
# A tibble: 7 × 1
# instGuess
# <chr>
# 1 university of new south wales
# 2 university of new south wales
# 3 friedrich-alexander-university
# 4 <NA>
# 5 university of queenslandqld
# 6 university of queenslandold
# 7 national university of singapore
I don't know if this de-duplication of strsplit
is impactful, though it is only beneficial if you have a large amount of data. Otherwise, it's just an OCD-level of efficiency without "premature optimization".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With