Efficient way to tokenize a string - C

Tags:

algorithm

I am trying to tokenize a string. I have a table of available tokens ordered in the form of a trie. Each token knows it has children. A simple tokens table will look like,

pattern    value         has_children
--------   ------        --------
s          s-val         1
stack      stack-val     0
over       over-val      1
overflow   overflow-val  0

In this table, stack is a child of s and overflow is a child of over. In practice, this table will have 5000+ records ordered in this way.

Now, given a string stackover, it should output stack-valover-val. Algorithm is greedy and it will try to find the longest match always.

To do this, I will start reading each character from the input, look for match, if a match found and the token has children, look for match again by including next character. Do this until we find the longest match. If no match found, try to match by including the next character until we reach the end of string or a successful match.

If we reached end of the string without a match, output ? symbol and remove the first character from the input. Repeat the whole process with remaining characters.

This algorithm works, but the backtracking and iterating on all possible combinations of the input makes it slow and complex.

I am wondering is there a better way of solving this? Any help would be appreciated.

512

asked Oct 31 '10 05:10

Navaneeth K N

1 Answers

Instead of backtracking you could keep in memory all possible results, until one result singles out at certain point in input stream. Example

Tokens: S STACK STACKOVERFLOW STAG OVER OVERFLOW
String: SSTACKOVERFUN

1 - Found S on place 0, have tokens that begin with S, try them all, only S is valid, so resolve S
2 - S on 1, have such tokens, try them, possible valid are S and STACK. Don't resolve, just keep them in mind.
3 - T on 2, have no such tokens, so S could be resolved now, but we also have longer token (STACK) so S is no good. Ditch S, and STACK is only left, but it has children. Try string for children. There are no possible children so resolve STACK
4 - O on 6, have such tokens, try them, have only OVER, so resolve OVER
5 - F on 10, no such tokens, and nothing to resolve from before so this is non-tokenizable
6 and 7 - same as step 5

Final result: S STACK OVER fun

answered Sep 27 '22 19:09

Dialecticus

Related questions
                            
                                Beginner extending C with Python (specifically Numpy)
                            
                                Binary Search Tree in C
                            
                                Some general C questions
                            
                                How to determine values saved on the stack?
                            
                                How to find out if SCSI device (say /etc/sda) is a disk or not via ioctl calls or other?
                            
                                An old flaw in X Window System. How does it work?
                            
                                8051 external interrupt
                            
                                C strange array behaviour
                            
                                C Programming: malloc() for a 2D array (using pointer-to-pointer)
                            
                                Plugin architecture in C using libdl
                            
                                How to test reliability of my own (small) embedded operating system?
                            
                                In C, is it possible do free only an array first or last position?
                            
                                How to get rid of void-pointers
                            
                                error C2504: 'BASECLASS' : base class undefined
                            
                                save and restore shell variables
                            
                                Objective-C preprocessor available?
                            
                                VIM open compile error in existing or new tab
                            
                                How to send unicode keys with c++ (keybd_event)
                            
                                Is this an invalid use of restrict pointers?
                            
                                Throwing an exception instead of error from a PHP extension [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With