Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Tags:

python

gensim

I am starting with some python task, I am facing a problem while using gensim. I am trying to load files from my disk and process them (split them and lowercase() them)

The code I have is below:

dictionary_arr=[]
for file_path in glob.glob(os.path.join(path, '*.txt')):
    with open (file_path, "r") as myfile:
        text=myfile.read()
        for words in text.lower().split():
            dictionary_arr.append(words)
dictionary = corpora.Dictionary(dictionary_arr)

The list (dictionary_arr) contains the list of all words across all the file, I then use gensim corpora.Dictionary to process the list. However I face a error.

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

I cant understand whats a problem, A little guidance would be appreciated.

like image 479
Sam Avatar asked Oct 20 '15 06:10

Sam


1 Answers

Dictionary needs a tokenized strings for its input:

dataset = ['driving car ',
           'drive car carefully',
           'student and university']

# be sure to split sentence before feed into Dictionary
dataset = [d.split() for d in dataset]

vocab = Dictionary(dataset)
like image 104
Amir Avatar answered Sep 22 '22 13:09

Amir