list(gensim.utils.simple_preprocess("i you he she I it we you they", deacc=True))
gives as result:
['you', 'he', 'she', 'it', 'we', 'you', 'they']
Is it normal? Are there any words that it skips? Should I use another tokenizer?
BONUS QUESTION: What does the "deacc=True" paramater mean?
As @user2357112-supports-monica mentions in their comment, this is part of the designed behavior of simple_preprocess(), per its documentation, to discard any tokens shorter than min_len=2 characters.
Your "bonus question" is also answered in that same documentation:
- deacc (bool, optional) – Remove accent marks from tokens using deaccent()?
(The deaccent() function is another utility function, documented at the link, which does exactly what the name and documentation suggest: removes accent marks from letters, so that, for example, 'é' becomes just 'e'.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With