Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does gensim's simple_preprocess Python tokenizer seem to skip the "i" token?

list(gensim.utils.simple_preprocess("i you he she I it we you they", deacc=True))

gives as result:

['you', 'he', 'she', 'it', 'we', 'you', 'they']

Is it normal? Are there any words that it skips? Should I use another tokenizer?

BONUS QUESTION: What does the "deacc=True" paramater mean?

like image 313
9879ypxkj Avatar asked Dec 09 '25 21:12

9879ypxkj


1 Answers

As @user2357112-supports-monica mentions in their comment, this is part of the designed behavior of simple_preprocess(), per its documentation, to discard any tokens shorter than min_len=2 characters.

Your "bonus question" is also answered in that same documentation:

  • deacc (bool, optional) – Remove accent marks from tokens using deaccent()?

(The deaccent() function is another utility function, documented at the link, which does exactly what the name and documentation suggest: removes accent marks from letters, so that, for example, 'é' becomes just 'e'.)

like image 190
gojomo Avatar answered Dec 12 '25 11:12

gojomo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!