If I have HTML that are lines like the following: (\t means Tab character)
<P>\tSome text</P>
<P>\t\tSome text</P>
<P>\tSome text</P>
Using regex, how can I convert the above to:
<P><BLOCKQUOTE>Some text</BLOCKQUOTE></P>
<P><BLOCKQUOTE><BLOCKQUOTE>Some text</BLOCKQUOTE></BLOCKQUOTE></P>
<p><BLOCKQUOTE>Some text></BLOCKQUOTE></P>
At the moment I have:
for $line (@lines)
{
$line =~ s{^(<P>(?:<BLOCKQUOTE>)*)\t(.+?)((?:</BLOCKQUOTE>)*</P>)$}{$1<BLOCKQUOTE>$2</BLOCKQUOTE>$3}g;
}
The tricky bit here is to somehow enter as many replacement tags as there are tabs.
I'd go with multiple passes, first counting the tabs and then going over the string again to replace them with the right number of open-close replacement tags (BLOCKQUOTE). In this case a single regex is bound to be much more complex and thus that much harder to tweak and maintain.
use warnings;
use strict;
use feature 'say';
my @test_strings = (
qq(<p>\t\ttwo tabs</p>),
qq(<p>\tone tab</p>),
qq(<p>no tab</p>),
qq(<div>\tnot paragraph</div>),
);
say for @test_strings; say '';
for (@test_strings)
{
if (my ($tabs) = /<p>(\t+)/) # capture consecutive tabs
{
my $nt = () = $tabs =~ /\t/g; # count them
my $ot = "<BLOCKQUOTE>" x $nt; # open-tag
my $ct = "</BLOCKQUOTE>" x $nt; # close-tag
s{<p> \t+ ([^\t].+?) </p>}{<p>$ot$1$ct</p>}x;
say;
}
}
Prints
<p> two tabs</p>
<p> one tab</p>
<p>no tab</p>
<div> not paragraph</div>
<p><BLOCKQUOTE><BLOCKQUOTE>two tabs</BLOCKQUOTE></BLOCKQUOTE></p>
<p><BLOCKQUOTE>one tab</BLOCKQUOTE></p>
<p>no tab</p>
<div> not paragraph</div>
Notes
As it stands this works with at most one paragraph (<p>...</p>) in the string, while
while (my ($tabs) = /<p>(\t+)/g) { ... }
(instead of if (...)) appears to work with multiple paragraphs. Needs more testing
Counting itself uses =()= "operator". It imposes list context on its right-hand side, so the regex returns the list of matches, assigned to a scalar on its left. Thus we get the count.
In this case, with $tabs consisting of only the tab characters, one can simply do
my $nt = split '', $tabs;
(Update: really just my $nt = length $tabs;, as in other answers)
I still use the regex since it'll work for a string with things other than just tabs, as well
The code replaces only the consecutive tabs in the beginning, right after <p>, not any that may come later in the string (how I see the requirement).
This is provided for by following the tabs in the pattern (\t+) with a single non-tab character and then any characters, [^\t].*?. Thus this matches for a string with more tabs further down but replaces only the leading "block" of tabs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With