Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identify meaningless or gibberish text from a data frame in R. Is there a way to partially match string/words to a dictionary?

Tags:

string

r

tm

I am looking to create a variable (column) in my data frame that identifies suspected meaningless text (e.g. "asdkjhfas"), or the inverse. This is part of a general script that will assist my team with cleaning survey data.

A function I found on stackoverflow (link & credit below) allows me to match single words to a dictionary, it does not identify multiple words.

Is there any way I can do a partial match (rather than strict) with a dictionary?

library(qdapDictionaries) # install.packages(qdap)

is.word  <- function(x) x %in% GradyAugmented

x <- c(1, 2, 3, 4, 5, 6)
y <- c("this is text", "word", "random", "Coca-cola", "this is meaningful                 
asdfasdf", "sadfsdf")
df <- data.frame(x,y)


df$z  [is.word(df$y)] <- TRUE
df

In a perfect world I would get a column: df$z TRUE TRUE TRUE TRUE TRUE NA

My actual results are: df$z NA TRUE TRUE NA NA NA

I would be more than happy with: df$z TRUE TRUE TRUE NA TRUE NA

I found the function is.word here Remove meaningless words from corpus in R thanks to user parth

like image 202
moose-png Avatar asked Sep 16 '25 15:09

moose-png


2 Answers

This works with dplyr and tidytext. A bit longer than I expected. There might a short cut somewhere.

I check if a sentence has words in it and count the number of TRUE values. If this is greater than 0, it has text in it, otherwise not.

library(tidytext)
library(dplyr)
df %>% unnest_tokens(words, y) %>% 
  mutate(text = words %in% GradyAugmented) %>% 
  group_by(x) %>% 
  summarise(z = sum(text)) %>% 
  inner_join(df) %>% 
  mutate(z = if_else(z > 0, TRUE, FALSE))


Joining, by = "x"
# A tibble: 6 x 3
      x z     y                          
  <dbl> <lgl> <chr>                      
1     1 TRUE  this is text               
2     2 TRUE  word                       
3     3 TRUE  random                     
4     4 TRUE  Coca-cola                  
5     5 TRUE  this is meaningful asdfasdf
6     6 FALSE sadfsdf     
like image 182
phiver Avatar answered Sep 18 '25 04:09

phiver


Here's a solution using purrr (along with dplyr and stringr):

library(tidyverse)

your_data <- tibble(text = c("this is text", "word", "random", "Coca-cola", "this is meaningful asdfasdf", "sadfsdf"))

your_data %>%
 # split the text on spaces and punctuation
 mutate(text_split = str_split(text, "\\s|[:punct:]")) %>% 
 # see if some element of the provided text is an element of your dictionary
 mutate(meaningful = map_lgl(text_split, some, is.element, GradyAugmented)) 

# A tibble: 6 x 3
  text                        text_split meaningful
  <chr>                       <list>     <lgl>     
1 this is text                <chr [3]>  TRUE      
2 word                        <chr [1]>  TRUE      
3 random                      <chr [1]>  TRUE      
4 Coca-cola                   <chr [2]>  TRUE      
5 this is meaningful asdfasdf <chr [4]>  TRUE      
6 sadfsdf                     <chr [1]>  FALSE     

like image 30
Ben G Avatar answered Sep 18 '25 04:09

Ben G