Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Removing comments with a sliding window without nested while loops

I'm trying to remove comments and strings from a c file with c code. I'll just stick to comments for the examples. I have a sliding window so I only have character n and n-1 at any given moment. I'm trying to figure out an algorithm that does not use nested whiles if possible, but I will need one while to getchar through the input. My first thought was to while through find when n=* and (n-1)=/ then while through until n=/ and (n-1)=*, but considering this has nested whiles I feel it is inefficient. I can do it this way if I have to, but I was wondering if anyone had a better solution.

like image 288
Mike Weber Avatar asked Apr 18 '13 15:04

Mike Weber

2 Answers

The algorithm written with one while loop could look like this:

while ((c = getchar()) != EOF)
    ... // looking at the byte that was just read

    if (...) // the symbol is not inside a comment

To decide whether the input char belongs to a comment, you can use a state machine. In the following example, it has 4 states; there are also rules for traversing to next state.

int state = 0;
int next_state;
while ((c = getchar()) != EOF)
    switch (state)
        case 0: next_state = (c == '/' ? 1 : 0); break;
        case 1: next_state = (c == '*' ? 2 : c == '/' ? 1 : 0); break;
        case 2: next_state = (c == '*' ? 3 : 2); break;
        case 3: next_state = (c == '/' ? 0 : c == '*' ? 3 : 2); break;
        default: next_state = state; // will never happen

    if (state == 1 && next_state == 0)
        putchar('/'); // for correct output when a slash is not followed by a star
    if (state == 0 && next_state == 0)
    state = next_state;

The example above is very simple: it doesn't work correctly for /* in non-comment contexts like in C strings; it doesn't support // comments, etc.

like image 51
anatolyg Avatar answered Oct 20 '22 01:10


Doing this correctly is more complicated than one may at first think, as ably pointed out by the other comments here. I would strongly recommend writing a table-driven FSM, using a state transition diagram to get the transitions right. Trying to do anything more than a few states with case statements is horribly error-prone IMO.

Here's a diagram in dot/graphviz format from which you could probably directly code a state table. Note that I haven't tested this at all, so YMMV.

The semantics of the diagram are that when you see <ch>, it is a fall-though if none of the other input in that state match. End of file is an error in any state except S0, and so is any character not explicitly listed, or <ch>. Every character scanned is printed except when in a comment (S4 and S5), and when detecting a start comment (S1). You will have to buffer characters when detecting a start comment, and print them if it's a false start, otherwise throw them away when sure it's really a comment.

In the dot diagram, sq is a single quote ', dq is a double quote ".

digraph state_machine {

    node [shape=doublecircle]; S0 /* init */;
    node [shape=circle];

    S0  /* init */      -> S1  /* begin_cmt */ [label = "'/'"];
    S0  /* init */      -> S2  /* in_str */    [label = dq];
    S0  /* init */      -> S3  /* in_ch */     [label = sq];
    S0  /* init */      -> S0  /* init */      [label = "<ch>"];
    S1  /* begin_cmt */ -> S4  /* in_slc */    [label = "'/'"];
    S1  /* begin_cmt */ -> S5  /* in_mlc */    [label = "'*'"];
    S1  /* begin_cmt */ -> S0  /* init */      [label = "<ch>"];
    S1  /* begin_cmt */ -> S1  /* begin_cmt */ [label = "'\\n'"]; // handle "/\n/" and "/\n*"
    S2  /* in_str */    -> S0  /* init */      [label = "'\\'"];
    S2  /* in_str */    -> S6  /* str_esc */   [label = "'\\'"];
    S2  /* in_str */    -> S2  /* in_str */    [label = "<ch>"];
    S3  /* in_ch */     -> S0  /* init */      [label = sq];
    S4  /* in_slc */    -> S4  /* in_slc */    [label = "<ch>"];
    S4  /* in_slc */    -> S0  /* init */      [label = "'\\n'"];
    S5  /* in_mlc */    -> S7  /* end_mlc */   [label = "'*'"];
    S5  /* in_mlc */    -> S5  /* in_mlc */    [label = "<ch>"];
    S7  /* end_mlc */   -> S7  /* end_mlc */   [label = "'*'|'\\n'"];
    S7  /* end_mlc */   -> S0  /* init */      [label = "'/'"];
    S7  /* end_mlc */   -> S5  /* in_mlc */    [label = "<ch>"];
    S6  /* str_esc */   -> S8  /* oct */       [label = "[0-3]"];
    S6  /* str_esc */   -> S9  /* hex */       [label = "'x'"];
    S6  /* str_esc */   -> S2  /* in_str */    [label = "<ch>"];
    S8  /* oct */       -> S10 /* o1 */        [label = "[0-7]"];
    S10 /* o1 */        -> S2  /* in_str */    [label = "[0-7]"];
    S9  /* hex */       -> S11 /* h1 */        [label = hex];
    S11 /* h1 */        -> S2  /* in_str */    [label = hex];
    S3  /* in_ch */     -> S12 /* ch_esc */    [label = "'\\'"];
    S3  /* in_ch */     -> S13 /* out_ch */    [label = "<ch>"];
    S13 /* out_ch */    -> S0  /* init */      [label = sq];
    S12 /* ch_esc */    -> S3  /* in_ch */     [label = sq];
    S12 /* ch_esc */    -> S12 /* ch_esc */    [label = "<ch>"];
like image 42
EdF Avatar answered Oct 20 '22 01:10