Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numbers of columns of arguments do not match

I am using this example to conduct sentiment analysis of a collection of txt documents in R. The code is:

library(tm)
library(tidyverse)
library(tidytext)
library(glue)
library(stringr)
library(dplyr)
library(wordcloud)
require(reshape2)

files <- list.files(inputdir,pattern="*.txt")

GetNrcSentiment <- function(file){

    fileName <- glue(inputdir, file, sep = "")
    fileName <- trimws(fileName)
    fileText <- glue(read_file(fileName))
    fileText <- gsub("\\$", "", fileText) 

    tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)

    # get the sentiment from the first text: 
    sentiment <- tokens %>%
        inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
        count(sentiment) %>% # count the # of positive & negative words
        spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
        mutate(sentiment = positive - negative) %>% # positive - negative
        mutate(file = file) %>% # add the name of our file
        mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
        mutate(city = str_match(file, "(.*?).2")[2]) 

    return(sentiment)
}

The .txt files are stored in inputdirand have names AB-City.0000, where AB is an abbreviation of a country, City is a city name and 0000 is year (ranges from 2000 to 2017).

The function works for a single file as expected, i.e. GetNrcSentiment(files[1]) gives me a tibble with proper counts per sentiment. However, when i try to run it for the whole set, i.e.

nrc_sentiments  <- data_frame()

for(i in files){
    nrc_sentiments <- rbind(nrc_sentiments, GetNrcSentiment(i))
}

I get the following error message:

Joining, by = "word"
Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

The exact same code works well with longer documents, but gives an error when dealing with shorter texts. It seems that not all sentiments are found in small documents and as a result the number of columns vary for each document, which might lead to this error, but I am not sure. I would appreciate any advice on how to fix the problem. If a sentiment is not found, I would want the entry to be equal to zero (if it is the cause of my problem).

As an aside, bing sentiment function runs through about two dozen of files and gives a different error, which seems to point to the same problem (negative sentiment not found?):

GetBingSentiment <- function(file){
    fileName <- glue(inputdir, file, sep = "")
    fileName <- trimws(fileName)

    fileText <- glue(read_file(fileName))
    fileText <- gsub("\\$", "", fileText)       
    tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)

    # get the sentiment from the first text: 
    sentiment <- tokens %>%
        inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
        count(sentiment) %>% # count the # of positive & negative words
        spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
        mutate(sentiment = positive - negative) %>% 
        mutate(file = file) %>% # add the name of our file
        mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
        mutate(city = str_match(file, "(.*?).2")[2])

    # return our sentiment dataframe
    return(sentiment)
}

Error in mutate_impl(.data, dots) : 
  Evaluation error: object 'negative' not found. 

EDIT: Following the recommendation by David Klotz I edited the code to

for(i in files){ nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i)) } 

As a result, instead of throwing an error the nrc generates NA if words from a certain sentiment are not found, however after 22 joinings i get a different error:

Error in mutate_impl(.data, dots) : Evaluation error: object 'negative' not found.

The same error shows up when run the bing function with dplyr. Both dataframes by the time the functions reaches 22nd document contain columns for all sentiments. What may cause the error and how to can diagnose it?

like image 748
Michael Avatar asked Jun 12 '18 15:06

Michael


1 Answers

dplyr's bind_rows function is more flexible than rbind, at least when it comes to missing columns:

nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i))
like image 175
David Klotz Avatar answered Sep 29 '22 13:09

David Klotz