List of unique words from data.frame

Question

I'm pretty new to R, so please be patient with me.

I have a vector of characters with a column that describes illnesses and diagnosis keywords in an inconsistent format. Samples are:

flu
fever/feverish
fever cold

I'm looking for the best way to extract all unique words from this. The best process I could figure out this far is giving me a vector of vectors:

[[1]]
[[1]][[1]]
[1] "flu"

[[2]]
[[2]][[1]]
[1] "fever" "feverish"
...

I achieve this by using:

split_words <- function(x){ strsplit(x, "[^[:alpha:]]+") }
lapply(diagnoses, split_words)

What is the best approach to convert this into a single vector or single column data frame so that I can run unique on this vector and remove duplicates.

What are the best packages on R to do word stemming to remove similar spellings, plurals etc.

Rich Scriven · Accepted Answer

You could use unlist after strsplit to get the vector with all elements, and unique for the unique elements.

x <- c("flu", "fever/feverish", "fever cold")
( ul <- unlist(strsplit(x, "\s+|[[:punct:]]")) )
# [1] "flu"      "fever"    "feverish" "fever"    "cold"  
unique(ul)
# [1] "flu"      "fever"    "feverish" "cold"

Paulo E. Cardoso · Answer

# > df
#         illness
#1            flu
#2 fever/feverish
#3     fever cold   


udf <- unlist(strsplit(df$illness, "[^[:alnum:]]"))
# [1] "flu"      "fever"    "feverish" "fever"    "cold"

table(udf)
#udf
#    cold    fever feverish      flu 
#       1        2        1        1

List of unique words from data.frame

Tags:

r

Hans

Video Answer

2 Answers

Rich Scriven

Paulo E. Cardoso

Recent Activity

Donate For Us

List of unique words from data.frame

Tags:

r

Hans

Video Answer

2 Answers

Rich Scriven

Paulo E. Cardoso

Related questions

Recent Activity

Donate For Us