Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenizer for full-text

Tags:

People also ask

What is the best tokenizer NLP?

Whitespace tokenization This is the most simple and commonly used form of tokenization. It splits the text whenever it finds whitespace characters. It is advantageous since it is a quick and easily understood method of tokenization. However, due to its simplicity, it does not take special cases into account.

How do you tokenize a string in NLP?

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. How sent_tokenize works ? The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.

How do you Tokenize a sentence using the nltk package?

NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.


This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.

Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.

Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probably in a thousand different ways, but I can't even find one of them :)

Any good pointers?

Thanks!