Why does gensim's simple_preprocess Python tokenizer seem to skip the "i" token?

Question

list(gensim.utils.simple_preprocess("i you he she I it we you they", deacc=True))

gives as result:

['you', 'he', 'she', 'it', 'we', 'you', 'they']

Is it normal? Are there any words that it skips? Should I use another tokenizer?

BONUS QUESTION: What does the "deacc=True" paramater mean?

gojomo · Accepted Answer

As @user2357112-supports-monica mentions in their comment, this is part of the designed behavior of simple_preprocess(), per its documentation, to discard any tokens shorter than min_len=2 characters.

Your "bonus question" is also answered in that same documentation:

deacc (bool, optional) – Remove accent marks from tokens using deaccent()?

(The deaccent() function is another utility function, documented at the link, which does exactly what the name and documentation suggest: removes accent marks from letters, so that, for example, 'é' becomes just 'e'.)

Why does gensim's simple_preprocess Python tokenizer seem to skip the "i" token?

Tags:

python

tokenize

nlp

gensim

9879ypxkj

1 Answers

gojomo

Recent Activity

Donate For Us

Why does gensim's simple_preprocess Python tokenizer seem to skip the "i" token?

Tags:

python

tokenize

nlp

gensim

9879ypxkj

1 Answers

gojomo

Related questions

Recent Activity

Donate For Us