Converting indentation with preg_replace (no callback)

Question

I have some XML chunk returned by DOMDocument::saveXML(). It's already pretty indented, with two spaces per level, like so:

<?xml version="1.0"?>
<root>
  <error>
    <a>eee</a>
    <b>sd</b>
  </error>
</root>

As it's not possible to configure DOMDocument (AFAIK) about the indentation character(s), I thought it's possible to run a regular expression and change the indentation by replacing all two-space-pairs into a tab. This can be done with a callback function (Demo):

$xml_string = $doc->saveXML();
function callback($m)
{
    $spaces = strlen($m[0]);
    $tabs = $spaces / 2;
    return str_repeat("	", $tabs);
}
$xml_string = preg_replace_callback('/^(?:[ ]{2})+/um', 'callback', $xml_string);

I'm now wondering if it's possible to do this w/o a callback function (and without the e-modifier (EVAL)). Any regex wizards with an idea?

Qtax · Accepted Answer

You can use \G:

preg_replace('/^  |\G  /m', "	", $string);

Did some benchmarks and got following results on Win32 with PHP 5.2 and 5.4:

>php -v
PHP 5.2.17 (cli) (built: Jan  6 2011 17:28:41)
Copyright (c) 1997-2010 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2010 Zend Technologies

>php -n test.php
XML length: 21100
Iterations: 1000
callback: 2.3627231121063
\G:       1.4221360683441
while:    3.0971200466156
/e:       7.8781840801239


>php -v
PHP 5.4.0 (cli) (built: Feb 29 2012 19:06:50)
Copyright (c) 1997-2012 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2012 Zend Technologies

>php -n test.php
XML length: 21100
Iterations: 1000
callback: 1.3771259784698
\G:       1.4414191246033
while:    2.7389969825745
/e:       5.5516891479492

Surprising that callback is faster than than \G in PHP 5.4 (altho that seems to depend on the data, \G is faster in some other cases).

For \G /^ |\G /m is used, and is a bit faster than /(?:^|\G) /m. /(?>^|\G) /m is even slower than /(?:^|\G) /m. /u, /S, /X switches didn't affect \G performance noticeably.

The while replace is fastest if depth is low (up to about 4 indentations, 8 spaces, in my test), but then gets slower as the depth increases.

The following code was used:

<?php

$base_iter = 1000;

$xml_string = str_repeat(<<<_STR_
<?xml version="1.0"?>
<root>
  <error>
    <a>  eee  </a>
    <b>  sd    </b>         
    <c>
            deep
                deeper  still
                    deepest  !
    </c>
  </error>
</root>
_STR_
, 100);


//*** while ***

$re = '%# Match leading spaces following leading tabs.
    ^                     # Anchor to start of line.
    (	*)                 # $1: Preserve any/all leading tabs.
    [ ]{2}                # Match "n" spaces.
    %mx';

function conv_indent_while($xml_string) {
    global $re;

    while(preg_match($re, $xml_string))
        $xml_string = preg_replace($re, "$1	", $xml_string);

    return $xml_string;
}


//*** \G ****

function conv_indent_g($string){
    return preg_replace('/^  |\G  /m', "	", $string);
}


//*** callback ***

function callback($m)
{
    $spaces = strlen($m[0]);
    $tabs = $spaces / 2;
    return str_repeat("	", $tabs);
}
function conv_indent_callback($str){
    return preg_replace_callback('/^(?:[ ]{2})+/m', 'callback', $str);
}


//*** callback /e *** 

function conv_indent_e($str){
    return preg_replace('/^(?:  )+/me', 'str_repeat("	", strlen("$0")/2)', $str);
}



//*** tests

function test2() {
    global $base_iter;
    global $xml_string;
    $t = microtime(true);

    for($i = 0; $i < $base_iter; ++$i){
        $s = conv_indent_while($xml_string);
        if(strlen($s) >= strlen($xml_string))
            exit("strlen invalid 2");
    }

    return (microtime(true) - $t);
}

function test1() {
    global $base_iter;
    global $xml_string;
    $t = microtime(true);

    for($i = 0; $i < $base_iter; ++$i){
        $s = conv_indent_g($xml_string);
        if(strlen($s) >= strlen($xml_string))
            exit("strlen invalid 1");
    }

    return (microtime(true) - $t);
}

function test0(){
    global $base_iter;
    global $xml_string;
    $t = microtime(true);

    for($i = 0; $i < $base_iter; ++$i){     
        $s = conv_indent_callback($xml_string);
        if(strlen($s) >= strlen($xml_string))
            exit("strlen invalid 0");
    }

    return (microtime(true) - $t);
}


function test3(){
    global $base_iter;
    global $xml_string;
    $t = microtime(true);

    for($i = 0; $i < $base_iter; ++$i){     
        $s = conv_indent_e($xml_string);
        if(strlen($s) >= strlen($xml_string))
            exit("strlen invalid 02");
    }

    return (microtime(true) - $t);
}



echo 'XML length: ' . strlen($xml_string) . "
";
echo 'Iterations: ' . $base_iter . "
";

echo 'callback: ' . test0() . "
";
echo '\G:       ' . test1() . "
";
echo 'while:    ' . test2() . "
";
echo '/e:       ' . test3() . "
";


?>

ridgerunner · Answer

The following simplistic solution first comes to mind:

$xml_string = str_replace('  ', "	", $xml_string);

But I assume, you would like to limit the replacement to leading whitespace only. For that case, your current solution looks pretty clean to me. That said, you can do it without a callback or the e modifier, but you need to run it recursively to get the job done like so:

$re = '%# Match leading spaces following leading tabs.
    ^                     # Anchor to start of line.
    (	*)                 # $1: Preserve any/all leading tabs.
    [ ]{2}                # Match "n" spaces.
    %umx';
while(preg_match($re, $xml_string))
    $xml_string = preg_replace($re, "$1	", $xml_string);

Surprisingly, my testing shows this to be nearly twice as fast as the callback method. (I would have guessed the opposite.)

Note that Qtax has an elegant solution that works just fine (I gave it my +1). However, my benchmarks show it to be slower than the original callback method. I think this is because the expression /(?:^|\G) /um does not allow the regex engine to take advantage of the: "anchor at the beginning of the pattern" internal optimization. The RE engine is forced to test the pattern against each and every position in the target string. With pattern expressions beginning with the ^ anchor, the RE engine only needs to check at the beginning of each line which allows it to match much faster.

Excellent question! +1

Addendum/Correction:

I must apologize because the performance statements I made above are wrong. I ran the regexes against only one (non-representative) test file which had mostly tabs in the leading whitespace. When tested against a more realistic file having lots of leading spaces, my recursive method above performs significantly slower than the other two methods.

If anyone is interested, here is the benchmark script I used to measure the performance of each regex:

`test.php`

<?php // test.php 20120308_1200
require_once('inc/benchmark.inc.php');

// -------------------------------------------------------
// Test 1: Recursive method. (ridgerunner)
function tabify_leading_spaces_1($xml_string) {
    $re = '%# Match leading spaces following leading tabs.
        ^                     # Anchor to start of line.
        (	*)                 # $1: Any/all leading tabs.
        [ ]{2}                # Match "n" spaces.
        %umx';
    while(preg_match($re, $xml_string))
        $xml_string = preg_replace($re, "$1	", $xml_string);
    return $xml_string;
}

// -------------------------------------------------------
// Test 2: Original callback method. (hakre)
function tabify_leading_spaces_2($xml_string) {
    return preg_replace_callback('/^(?:[ ]{2})+/um', '_callback', $xml_string);
}
function _callback($m) {
    $spaces = strlen($m[0]);
    $tabs = $spaces / 2;
    return str_repeat("	", $tabs);
}

// -------------------------------------------------------
// Test 3: Qtax's elegantly simple \G method. (Qtax)
function tabify_leading_spaces_3($xml_string) {
    return preg_replace('/(?:^|\G)  /um', "	", $xml_string);
}

// -------------------------------------------------------
// Verify we get the same results from all methods.
$data = file_get_contents('testdata.txt');
$data1 = tabify_leading_spaces_1($data);
$data2 = tabify_leading_spaces_2($data);
$data3 = tabify_leading_spaces_3($data);
if ($data1 == $data2 && $data2 == $data3) {
    echo ("GOOD: Same results.
");
} else {
    exit("BAD: Different results.
");
}
// Measure and print the function execution times.
$time1 = benchmark_12('tabify_leading_spaces_1', $data, 2, true);
$time2 = benchmark_12('tabify_leading_spaces_2', $data, 2, true);
$time3 = benchmark_12('tabify_leading_spaces_3', $data, 2, true);
?>

The above script uses the following handy little benchmarking function I wrote some time ago:

`benchmark.inc.php`

<?php // benchmark.inc.php
/*----------------------------------------------------------------------------
 function benchmark_12($funcname, $p1, $reptime = 1.0, $verbose = true, $p2 = NULL) {}
    By: Jeff Roberson
    Created:        2010-03-17
    Last edited:    2012-03-08

Discussion:
    This function measures the time required to execute a given function by
    calling it as many times as possible within an allowed period == $reptime.
    A first pass determines a rough measurement of function execution time
    by increasing the $nreps count by a factor of 10 - (i.e. 1, 10, 100, ...),
    until an $nreps value is found which takes more than 0.01 secs to finish.
    A second pass uses the value determined in the first pass to compute the
    number of reps that can be performed within the allotted $reptime seconds.
    The second pass then measures the time required to call the function the
    computed number of times (which should take about $reptime seconds). The
    average function execution time is then computed by dividing the total
    measured elapsed time by the number of reps performed in that time, and
    then all the pertinent values are returned to the caller in an array.

    Note that this function is limited to measuring only those functions
    having either one or two arguments that are passed by value and
    not by reference. This is why the name of this function ends with "12".
    Variations of this function can be easily cloned which can have more
    than two parameters.

Parameters:
    $funcname:  String containing name of function to be measured. The
                function to be measured must take one or two parameters.
    $p1:        First argument to be passed to $funcname function.
    $reptime    Target number of seconds allowed for benchmark test.
                (float) (Default=1.0)
    $verbose    Boolean value determines if results are printed.
                (bool) (Default=true)
    $p2:        Second (optional) argument to be passed to $funcname function.
Return value:
    $result[]   Array containing measured and computed values:
    $result['funcname']     : $funcname - Name of function measured.
    $result['msg']          : $msg - String with formatted results.
    $result['nreps']        : $nreps - Number of function calls made.
    $result['time_total']   : $time - Seconds to call function $nreps times.
    $result['time_func']    : $t_func - Seconds to call function once.
    $result['result']       : $result - Last value returned by function.

Variables:
    $time:      Float epoch time (secs since 1/1/1970) or benchmark elapsed secs.
    $i:         Integer loop counter.
    $nreps      Number of times function called in benchmark measurement loops.

----------------------------------------------------------------------------*/
function benchmark_12($funcname, $p1, $reptime = 1.0, $verbose = false, $p2 = NULL) {
    if (!function_exists($funcname)) {
        exit("
[benchmark1] Error: function \"{$funcname}()\" does not exist.
");
    }
    if (!isset($p2)) { // Case 1: function takes one parameter ($p1).
    // Pass 1: Measure order of magnitude number of calls needed to exceed 10 milliseconds.
        for ($time = 0.0, $n = 1; $time < 0.01; $n *= 10) { // Exponentially increase $nreps.
            $time = microtime(true);            // Mark start time. (sec since 1970).
            for ($i = 0; $i < $n; ++$i) {       // Loop $n times. ($n = 1, 10, 100...)
                $result = ($funcname($p1));     // Call the function over and over...
            }
            $time = microtime(true) - $time;    // Mark stop time. Compute elapsed secs.
            $nreps = $n;                        // Number of reps just measured.
        }
        $t_func = $time / $nreps;               // Function execution time in sec (rough).
    // Pass 2: Measure time required to perform $nreps function calls (in about $reptime sec).
        if ($t_func < $reptime) {               // If pass 1 time was not pathetically slow...
            $nreps = (int)($reptime / $t_func); // Figure $nreps calls to add up to $reptime.
            $time = microtime(true);            // Mark start time. (sec since 1970).
            for ($i = 0; $i < $nreps; ++$i) {   // Loop $nreps times (should take $reptime).
                $result = ($funcname($p1));     // Call the function over and over...
            }
            $time = microtime(true) - $time;    // Mark stop time. Compute elapsed secs.
            $t_func = $time / $nreps;           // Average function execution time in sec.
        }
    } else { // Case 2: function takes two parameters ($p1 and $p2).
    // Pass 1: Measure order of magnitude number of calls needed to exceed 10 milliseconds.
        for ($time = 0.0, $n = 1; $time < 0.01; $n *= 10) { // Exponentially increase $nreps.
            $time = microtime(true);            // Mark start time. (sec since 1970).
            for ($i = 0; $i < $n; ++$i) {       // Loop $n times. ($n = 1, 10, 100...)
                $result = ($funcname($p1, $p2));     // Call the function over and over...
            }
            $time = microtime(true) - $time;    // Mark stop time. Compute elapsed secs.
            $nreps = $n;                        // Number of reps just measured.
        }
        $t_func = $time / $nreps;               // Function execution time in sec (rough).
    // Pass 2: Measure time required to perform $nreps function calls (in about $reptime sec).
        if ($t_func < $reptime) {               // If pass 1 time was not pathetically slow...
            $nreps = (int)($reptime / $t_func); // Figure $nreps calls to add up to $reptime.
            $time = microtime(true);            // Mark start time. (sec since 1970).
            for ($i = 0; $i < $nreps; ++$i) {   // Loop $nreps times (should take $reptime).
                $result = ($funcname($p1, $p2));     // Call the function over and over...
            }
            $time = microtime(true) - $time;    // Mark stop time. Compute elapsed secs.
            $t_func = $time / $nreps;           // Average function execution time in sec.
        }
    }
    $msg = sprintf("%s() Nreps:%7d  Time:%7.3f s  Function time: %.6f sec
",
            $funcname, $nreps, $time, $t_func);
    if ($verbose) echo($msg);
    return array('funcname' => $funcname, 'msg' => $msg, 'nreps' => $nreps,
        'time_total' => $time, 'time_func' => $t_func, 'result' => $result);
}
?>

When I run test.php using the contents of benchmark.inc.php, here's the results I get:

GOOD: Same results.
tabify_leading_spaces_1() Nreps: 1756 Time: 2.041 s Function time: 0.001162 sec
tabify_leading_spaces_2() Nreps: 1738 Time: 1.886 s Function time: 0.001085 sec
tabify_leading_spaces_3() Nreps: 2161 Time: 2.044 s Function time: 0.000946 sec

Bottom line: I would recommend using Qtax's method.

Thanks Qtax!

Converting indentation with preg_replace (no callback)

Tags:

hakre

2 Answers

Qtax

Addendum/Correction:

`test.php`

`benchmark.inc.php`

ridgerunner

Recent Activity

Donate For Us

Converting indentation with preg_replace (no callback)

Tags:

hakre

2 Answers

Qtax

Addendum/Correction:

test.php

benchmark.inc.php

ridgerunner

Related questions

Recent Activity

Donate For Us

`test.php`

`benchmark.inc.php`