Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to tokenize a string - C

Tags:

c

algorithm

I am trying to tokenize a string. I have a table of available tokens ordered in the form of a trie. Each token knows it has children. A simple tokens table will look like,

pattern    value         has_children
--------   ------        --------
s          s-val         1
stack      stack-val     0
over       over-val      1
overflow   overflow-val  0

In this table, stack is a child of s and overflow is a child of over. In practice, this table will have 5000+ records ordered in this way.

Now, given a string stackover, it should output stack-valover-val. Algorithm is greedy and it will try to find the longest match always.

To do this, I will start reading each character from the input, look for match, if a match found and the token has children, look for match again by including next character. Do this until we find the longest match. If no match found, try to match by including the next character until we reach the end of string or a successful match.

If we reached end of the string without a match, output ? symbol and remove the first character from the input. Repeat the whole process with remaining characters.

This algorithm works, but the backtracking and iterating on all possible combinations of the input makes it slow and complex.

I am wondering is there a better way of solving this? Any help would be appreciated.

like image 512
Navaneeth K N Avatar asked Oct 31 '10 05:10

Navaneeth K N


People also ask

What is Tokenizing a string in C?

The C function strtok() is a string tokenization function that takes two arguments: an initial string to be parsed and a const -qualified character delimiter. It returns a pointer to the first character of a token or to a null pointer if there is no token.

What is Strtok_r in C?

The function strtok_r() returns a pointer to the first character of the first token, writes a NULL character into s immediately following the returned token, and updates the pointer to which lasts points.

What is Tokenizing a string?

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.


1 Answers

Instead of backtracking you could keep in memory all possible results, until one result singles out at certain point in input stream. Example

Tokens: S STACK STACKOVERFLOW STAG OVER OVERFLOW
String: SSTACKOVERFUN

1 - Found S on place 0, have tokens that begin with S, try them all, only S is valid, so resolve S
2 - S on 1, have such tokens, try them, possible valid are S and STACK. Don't resolve, just keep them in mind.
3 - T on 2, have no such tokens, so S could be resolved now, but we also have longer token (STACK) so S is no good. Ditch S, and STACK is only left, but it has children. Try string for children. There are no possible children so resolve STACK
4 - O on 6, have such tokens, try them, have only OVER, so resolve OVER
5 - F on 10, no such tokens, and nothing to resolve from before so this is non-tokenizable
6 and 7 - same as step 5

Final result: S STACK OVER fun

like image 53
Dialecticus Avatar answered Sep 27 '22 19:09

Dialecticus