I'm pretty new to R, so please be patient with me.
I have a vector of characters with a column that describes illnesses and diagnosis keywords in an inconsistent format. Samples are:
flu
fever/feverish
fever cold
I'm looking for the best way to extract all unique words from this. The best process I could figure out this far is giving me a vector of vectors:
[[1]]
[[1]][[1]]
[1] "flu"
[[2]]
[[2]][[1]]
[1] "fever" "feverish"
...
I achieve this by using:
split_words <- function(x){ strsplit(x, "[^[:alpha:]]+") }
lapply(diagnoses, split_words)
What is the best approach to convert this into a single vector or single column data frame so that I can run unique
on this vector and remove duplicates.
What are the best packages on R to do word stemming to remove similar spellings, plurals etc.
You could use unlist
after strsplit
to get the vector with all elements, and unique
for the unique elements.
x <- c("flu", "fever/feverish", "fever cold")
( ul <- unlist(strsplit(x, "\\s+|[[:punct:]]")) )
# [1] "flu" "fever" "feverish" "fever" "cold"
unique(ul)
# [1] "flu" "fever" "feverish" "cold"
# > df
# illness
#1 flu
#2 fever/feverish
#3 fever cold
udf <- unlist(strsplit(df$illness, "[^[:alnum:]]"))
# [1] "flu" "fever" "feverish" "fever" "cold"
table(udf)
#udf
# cold fever feverish flu
# 1 2 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With