My data is already in a data frame, with one token per line. I'd like to filter out the rows that contain stop words.
The dataframe looks like:
docID <- c(1,2,2)
token <- c('the', 'cat', 'sat')
count <- c(10,20,30)
df <- data.frame(docID, token, count)
I've tried the below, but get an error:
library(tidyverse)
library(tidytext)
library(topicmodels)
library(stringr)
data('stop_words')
clean_df <- df %>%
anti_join(stop_words, by=df$token)
Error:
Error: `by` can't contain join column `the`, `cat`, `sat` which is missing from LHS
How can I resolve this?
When you set up anti_join(), you need to say what the column names are, on the left and right hand sides. In the stop_words data object in tidytext, the column is called word and in your dataframe, it is called token.
library(tidyverse)
library(tidytext)
docID <- c(1, 2, 2, 2, 3)
token <- c("the", "cat", "sat", "on-the-mat", "with3hats")
count <- c(10, 20, 30, 10, 20)
df <- data_frame(docID, token, count)
clean_df <- df %>%
anti_join(stop_words, by= c("token" = "word"))
clean_df
#> # A tibble: 4 x 3
#> docID token count
#> <dbl> <chr> <dbl>
#> 1 2.00 cat 20.0
#> 2 2.00 sat 30.0
#> 3 2.00 on-the-mat 10.0
#> 4 3.00 with3hats 20.0
Notice that "the" is now gone because it is in the stop_words dataset.
In a comment, you asked about removing tokens that contain punctuation or numbers. I'd use filter() for this (you can actually use filter() to remove stopwords too, if you prefer.)
clean_df <- df %>%
filter(!str_detect(token, "[:punct:]|[:digit:]"))
clean_df
#> # A tibble: 3 x 3
#> docID token count
#> <dbl> <chr> <dbl>
#> 1 1.00 the 10.0
#> 2 2.00 cat 20.0
#> 3 2.00 sat 30.0
If you want to do both, build up your object with both lines using pipes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With