I am trying to tokenize a string. I have a table of available tokens ordered in the form of a trie. Each token knows it has children. A simple tokens table will look like,
pattern value has_children
-------- ------ --------
s s-val 1
stack stack-val 0
over over-val 1
overflow overflow-val 0
In this table, stack
is a child of s
and overflow
is a child of over
. In practice, this table will have 5000+ records ordered in this way.
Now, given a string stackover
, it should output stack-valover-val
. Algorithm is greedy and it will try to find the longest match always.
To do this, I will start reading each character from the input, look for match, if a match found and the token has children, look for match again by including next character. Do this until we find the longest match. If no match found, try to match by including the next character until we reach the end of string or a successful match.
If we reached end of the string without a match, output ?
symbol and remove the first character from the input. Repeat the whole process with remaining characters.
This algorithm works, but the backtracking and iterating on all possible combinations of the input makes it slow and complex.
I am wondering is there a better way of solving this? Any help would be appreciated.
The C function strtok() is a string tokenization function that takes two arguments: an initial string to be parsed and a const -qualified character delimiter. It returns a pointer to the first character of a token or to a null pointer if there is no token.
The function strtok_r() returns a pointer to the first character of the first token, writes a NULL character into s immediately following the returned token, and updates the pointer to which lasts points.
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.
Instead of backtracking you could keep in memory all possible results, until one result singles out at certain point in input stream. Example
Tokens: S STACK STACKOVERFLOW STAG OVER OVERFLOW
String: SSTACKOVERFUN
1 - Found S on place 0, have tokens that begin with S, try them all, only S is valid, so resolve S
2 - S on 1, have such tokens, try them, possible valid are S and STACK. Don't resolve, just keep them in mind.
3 - T on 2, have no such tokens, so S could be resolved now, but we also have longer token (STACK) so S is no good. Ditch S, and STACK is only left, but it has children. Try string for children. There are no possible children so resolve STACK
4 - O on 6, have such tokens, try them, have only OVER, so resolve OVER
5 - F on 10, no such tokens, and nothing to resolve from before so this is non-tokenizable
6 and 7 - same as step 5
Final result: S STACK OVER fun
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With