Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Question

I am starting with some python task, I am facing a problem while using gensim. I am trying to load files from my disk and process them (split them and lowercase() them)

The code I have is below:

dictionary_arr=[]
for file_path in glob.glob(os.path.join(path, '*.txt')):
    with open (file_path, "r") as myfile:
        text=myfile.read()
        for words in text.lower().split():
            dictionary_arr.append(words)
dictionary = corpora.Dictionary(dictionary_arr)

The list (dictionary_arr) contains the list of all words across all the file, I then use gensim corpora.Dictionary to process the list. However I face a error.

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

I cant understand whats a problem, A little guidance would be appreciated.

Amir · Accepted Answer

Dictionary needs a tokenized strings for its input:

dataset = ['driving car ',
           'drive car carefully',
           'student and university']

# be sure to split sentence before feed into Dictionary
dataset = [d.split() for d in dataset]

vocab = Dictionary(dataset)

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Tags:

python

gensim

Sam

1 Answers

Amir

Recent Activity

Donate For Us

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Tags:

python

gensim

Sam

1 Answers

Amir

Related questions

Recent Activity

Donate For Us