Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

perl6 How to get all lines that are not indented by x-width of spaces?

I have a very, very large text file that I am working on that has lines with various sizes of indentations. Those lines that are acceptable have 12-character-width of indentations, which are created by combination of tabs and spaces. Now I want to get all the lines that do not have 12-character-width of indentation, and those lines have anywhere from 0-to-11-character-width of indentations from combinations of tabs and space-chars.

if $badLine !~~ m/ ^^ [\s ** 12 ||
                      \t \s ** 4 ||
                      \s \t \s ** 3 ] / { say $badLine; }

But the problem is that when you are working on a text file with a word processor, pressing a tab key can give you anywhere from 0 to 8 space-char-width to fill the gap. What would be a smart way to get all those non-acceptable lines that do not have 12-char-width indentations?

Thanks.

like image 968
lisprogtor Avatar asked Dec 23 '22 21:12

lisprogtor


1 Answers

Width 12

For an indentation width of 12, assuming that tab stops are at positions 0, 8, 16 etc:

for $input.lines {
    .say if not /
        ^                             # start of line
        [" " ** 8 || " " ** 0..7 \t]  # whitespace up to first tab stop
        [" " ** 4]                    # whitespace up to position 12
        [\S | $]                      # non-space character or end of line
    /;
}

Explanation:

  1. To get from the start of the line (position 0) to the first tab stop (position 8), there are two possibilities we need to match:

    • 8 spaces.
    • 0 to 7 spaces, followed by 1 tab. (The tab jumps straight to the tab stop, so it fills out whatever width remains after the spaces.)
  2. The only way to get from the tab stop (position 8) to the indentation target (position 12), is 4 spaces. (A tab would jump past the target to the next tab stop at position 16.)

  3. Anchoring to the start of the line, and to whatever comes after the indentation, is important so that we don't accidentally match part of a longer indentation.

Arbitrary width

The indentation matching can be factored out into a parameterized named token that can handle arbitrary widths:

my token indent ($width) {
    [" " ** 8 || " " ** 0..7 \t] ** {$width div 8}
     " " ** {$width % 8}
}

.say if not /^ <indent(12)> [\S | $]/ for $input.lines;

Explanation:

  1. The same expression as above is used to get to the first tab stop, but now it is repeated as many times as necessary to get to the last tab stop before the target. ($width div 8 times in total, where div is the integer division operator).

  2. Whatever distance is left between the last tab stop and the target, must be filled with spaces. ($width % 8 spaces, where % is the modulo operator.)

Arbitrary position and width

The token in the above example assumes that it starts matching at a tab stop position (such as the start of the line). It can be further generalized to match a given width of tabs and spaces, no matter where in the line you call it:

my token indent ($width) {  
    :my ($before-first-stop, $numer-of-stops, $after-last-stop);
    {
        $before-first-stop = min $width, 8 - $/.from % 8;
        $numer-of-stops    = ($width - $before-first-stop) div 8;
        $after-last-stop   = ($width - $before-first-stop) % 8;
    }
    [" " ** {$before-first-stop} || " " ** {^$before-first-stop} \t]
    [" " ** 8 || " " ** 0..7 \t] ** {$numer-of-stops}
     " " ** {$after-last-stop}
}

Explanation:

  1. Same principle as before, except that now we first need to match as many spaces as necessary to get from the current position in the string to the first tab stop that follows it.

  2. The current position in the string is given by $/.from; the rest is simple arithmetic.

  3. A few lexical variables (with hopefully descriptive names) are declared and used inside the token, to make the code easier to follow.

like image 103
smls Avatar answered May 10 '23 12:05

smls