Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference between Tokenization and Segmentation

What is the difference between Tokenization and Segmentation in NLP. I searched about them but I didn't really find any differences .

like image 463
Mahmoud Noor Avatar asked May 29 '26 22:05

Mahmoud Noor


1 Answers

Short answer: All tokenization is segmentation, but not all segmentation is tokenization.

Long Answer:
While segmentation is a more generic concept of splitting the input text, tokenization is a type of segmentation and it is carried out based on a well defined criteria.
For example - in a hypothetical scenario if all your input sentences are compound sentences of two sub-sentences, then splitting them into two independent sentences can be termed as segmentation (but not tokenization).
Tokenization is a form of segmentation which is performed on the basis of a semantic criteria or using a token dictionary - e.g. a word or sub-word tokenization, mainly with an intention of assigning them token ids for downstream processing.

like image 58
jdsurya Avatar answered Jun 01 '26 03:06

jdsurya