This question is about understanding the behaviour of a specific regex in TCL 8.5 built into Vivado, in particular <code>or</code>-ing together two regex parts I get unexpected results: I worked on indenting a block of text for the command line using regular expressions. My first thought was to replace every <code>newline</code> by a <code>newline</code> and some <code>spaces</code> (replaced by <code>X</code> here for clarity) for indentation, so: <pre class="prettyprint"><code>puts [regsub -all "\n" "foo\nBar\nBaz" "\nXX"] foo XXBar XXBaz </code></pre> This does not indent the first line, to match the first line I use <code>^</code>: <pre class="prettyprint"><code>puts [regsub -all "^" "foo\nBar\nBaz" "\nXX"] XXfoo Bar Baz </code></pre> Now it should just be a matter of comibining the two regex parts with an <code>|</code>, however I get output I can not explain: <pre class="prettyprint"><code>puts [regsub -all "^|\n" "foo\nBar\nBaz" "\nXX"] XXfoo XX XXBar XX XXBaz </code></pre> demo Where do the additonal newlines and identiation marks (<code>X</code>) come from? Why does it look like I get two substitutions? Is this a bug, or is there a bit I do not understand about regular expression syntax? For completnes sake here is the regex I use now <code>puts [regsub -all -line "^" "foo\nBar\nBaz" "XX"]</code>

<h3>Basic versus Extended regular expressions</h3> I think the explanation hinges on the fact that the expression <code>^</code> is treated as a basic regular expression (BRE), but when you add <code>|</code> it is treated like an advanced regular expression (ARE), which is a superset of extended regular expressions (ERE). This is based on the following, from the re_syntax man page: <blockquote> An ARE is one or more branches, separated by “|”, matching anything that matches any of the branches. </blockquote> The second part of the puzzle is that <code>^</code> is treated differently in basic and extended/advanced regular expressions. In a basic regular expression, <code>^</code> only has a special meaning when it is the first character of the expression. Again, from the re_syntax man page: <blockquote> BREs differ from EREs in several respects ... ^ is an ordinary character except at the beginning of the RE or the beginning of a parenthesized subexpression,... </blockquote> In other words, for a BRE, <code>^</code> will only match the very start of the string, but in an ARE it will match the beginning of a line. So, what exactly is happening? First, <code>^</code> matches the beginning of a string, so it replaces it with the replacement <code>\nXX</code>. Next, it sees <code>f</code>, then <code>o</code>, then <code>o</code>, none of which matches. Then it sees '\n`, which it matches, so it replaces it with the replacement. At this point the matcher has consumed the characters <code>foo\n</code>. What remains is <code>Bar\nBaz</code>. The matcher now looks at that string, and the pattern <code>^</code> matches, so it again replaces it with the replacement. Thus, you end up with two copies of the replacement string, one for the newline and one for the beginning of the string that remains. <h3>Adding something to the start of every line</h3> If your end goal is to add indentation to every line, you can use newline sensitive matching with regsub and then use <code>^</code> to match every line including the first, rather than try to match both newlines and the start of the string. You do this by adding the <code>--line</code> option to <code>regsub</code>. For example: <pre class="prettyprint"><code>regsub -line -all "^" "foo\nBar\nBaz" "XX" t; puts $t </code></pre>

Regex quirk in tcl

Tags:

regex

tcl

This question is about understanding the behaviour of a specific regex in TCL 8.5 built into Vivado, in particular or-ing together two regex parts I get unexpected results:

I worked on indenting a block of text for the command line using regular expressions. My first thought was to replace every newline by a newline and some spaces (replaced by X here for clarity) for indentation, so:

puts [regsub -all "\n" "foo\nBar\nBaz" "\nXX"]
foo
XXBar
XXBaz

This does not indent the first line, to match the first line I use ^:

puts [regsub -all "^" "foo\nBar\nBaz" "\nXX"]

XXfoo
Bar
Baz

Now it should just be a matter of comibining the two regex parts with an |, however I get output I can not explain:

puts [regsub -all "^|\n" "foo\nBar\nBaz" "\nXX"]

XXfoo
XX
XXBar
XX
XXBaz

demo

Where do the additonal newlines and identiation marks (X) come from? Why does it look like I get two substitutions? Is this a bug, or is there a bit I do not understand about regular expression syntax?

For completnes sake here is the regex I use now puts [regsub -all -line "^" "foo\nBar\nBaz" "XX"]

368

asked Dec 27 '17 16:12

ted

1 Answers

Basic versus Extended regular expressions

I think the explanation hinges on the fact that the expression ^ is treated as a basic regular expression (BRE), but when you add | it is treated like an advanced regular expression (ARE), which is a superset of extended regular expressions (ERE). This is based on the following, from the re_syntax man page:

An ARE is one or more branches, separated by “|”, matching anything that matches any of the branches.

The second part of the puzzle is that ^ is treated differently in basic and extended/advanced regular expressions. In a basic regular expression, ^ only has a special meaning when it is the first character of the expression. Again, from the re_syntax man page:

BREs differ from EREs in several respects ... ^ is an ordinary character except at the beginning of the RE or the beginning of a parenthesized subexpression,...

In other words, for a BRE, ^ will only match the very start of the string, but in an ARE it will match the beginning of a line.

So, what exactly is happening?

First, ^ matches the beginning of a string, so it replaces it with the replacement \nXX. Next, it sees f, then o, then o, none of which matches. Then it sees '\n`, which it matches, so it replaces it with the replacement.

At this point the matcher has consumed the characters foo\n. What remains is Bar\nBaz. The matcher now looks at that string, and the pattern ^ matches, so it again replaces it with the replacement. Thus, you end up with two copies of the replacement string, one for the newline and one for the beginning of the string that remains.

Adding something to the start of every line

If your end goal is to add indentation to every line, you can use newline sensitive matching with regsub and then use ^ to match every line including the first, rather than try to match both newlines and the start of the string. You do this by adding the --line option to regsub. For example:

regsub -line -all "^" "foo\nBar\nBaz" "XX" t; puts $t

178

answered Oct 29 '22 20:10

Bryan Oakley

Related questions
                            
                                Odd Behavior with Greedy Modifiers Inside Capture Groups
                            
                                Non-greedy matching with grep
                            
                                Find match over Array of RegEx in MongoDB Collection
                            
                                perl's $-[0] produces unexpected results for non-ASCII data
                            
                                Start matching from the end of a string
                            
                                A partial match changes the Matcher's position
                            
                                Ansible lineinfile duplication using insertafter
                            
                                Using escape characters inside grep
                            
                                How to handle the different dialects of regular expressions (java vs. xsd)?
                            
                                How do HTML parsers work?
                            
                                How many backslashes are required to escape regexps in emacs' "Customize" mode?
                            
                                Best delimiter to separate multipe regex
                            
                                Find digits in file names and cross reference them with others
                            
                                What is the Python way of doing a \G anchored parsing loop?
                            
                                RequestMapping with slashes and dot
                            
                                How do I fuzzy match word to a full word (and only full word) in a sentence?
                            
                                Is it possible to match multiple heredoc expressions with regexes?
                            
                                What’s the equivalent of rsplit() with re.split()?
                            
                                Limit access to an URL with query parameters
                            
                                Regex challenge: changing formats of negative numbers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With