Extract html attributes from string in PHP [duplicate]

Question

I have a variable that looks like this:

$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>'

and I want to extract the data-tpl-attributes in a way so I end up with a resulting array that looks like this:

$array = (
    'classname' => 'class',
    'title' => 'innerHTML'
)

The number of "data-tpl-" attributes varies, and it's not always an <li> element. Other than that, it always follows the same format: data-tpl-attributename="attributePlacement".

How can I retrieve those attributes and store them in an array, without using regex? I say without regex since everywhere I look it seems like parsing html using regex is an evil practice, or is it ok in this case?

Shankar Narayana Damodaran · Accepted Answer

You can very well make use of a DOMDocument class and yeah don't use regular expressions. This is just a start and you can very well explore it.

<?php
$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>';
echo "<pre>";

function parseTag($content,$tg)
{
    $dom = new DOMDocument;
    $dom->loadHTML($content);
    $attr = array();
    foreach ($dom->getElementsByTagName($tg) as $tag) {
        foreach ($tag->attributes as $attribName => $attribNodeVal)
        {
           $attr[$attribName]=$tag->getAttribute($attribName);
        }
    }
    return $attr;
}

$attrib_arr = parseTag($var,'li');
print_r($attrib_arr);

OUTPUT :

Array
(
    [data-tpl-classname] => class
    [data-tpl-title] => innerHTML
)

Demo

Balázs Varga · Answer

You can extract the values by using some string functions. It looks like this:

$test1 = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>';
$test2 = '<div data-tpl-anything="something" data-tpl-title="this is a title" data-tpl-third="asdasd"></div>';

var_dump(extract_tpl($test1));
var_dump(extract_tpl($test2));

function extract_tpl($string,$prefix="data-tpl-") {
    $start = 0;
    $end = 0;

    while(strpos($string,$prefix,$end))
    {
        $start = strpos($string,$prefix,$start)+strlen($prefix);
        $end = strpos($string,'"',$start)-1;
        $end2 = strpos($string,'"',$end+2);
        $array[substr($string,$start,$end-$start)] = substr($string,$end+2,$end2-$end-2);
    }

    return $array;
}

Output:

array (size=2)
  'classname' => string 'class' (length=5)
  'title' => string 'innerHTML' (length=9)

array (size=3)
  'anything' => string 'something' (length=9)
  'title' => string 'this is a title' (length=15)
  'third' => string 'asdasd' (length=6)

The numbers in the code ( -1, +2, ... ) is for skipping the symbols like " .

Tronix117 · Answer

It's evil without being it completely, of course, it may be slow on big strings or on really complex regexp, which is not your case. And it is still (more readable?), easier and quicker to implement than HTML or XML parser, which are not more optimized than a simple regexp match.

$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>'
preg_match_all("data-tpl-([^"]*)="([^"]*)"/i", $str, $matches);

$array = array();
for($i = 1, $size = count($matches); $i < $size; ++$i){
  $array[$matches[$i][0]] = $matches[$i][1];
}

I used [^"]* instead of .*? since it is a bit quicker.

Note: I just made a benchmark. Compared to the first answer using DOMDocument, this code using Regexp is 4 time faster, but less cleaner since parsing Dom using regexp may lead to misinterpretations of the markup. And it is slightly slower than the answer using str functions (but easier to read and to maintain).

Note 2: Of course use this solution only if there will never be any confusion and if you are sure of the input format, in the contrary the solution with DOMDocument is cleaner.

Why regular expression should be used wisely or avoided when parsing HTML:

http://blog.codinghorror.com/parsing-html-the-cthulhu-way

Use them with that in mind:

It's generally a bad idea.

Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it.

I had what I thought to be good, rational, (semi) defensible reasons for choosing regular expressions in this specific scenario.

Extract html attributes from string in PHP [duplicate]

Tags:

arrays

dom

function

php

parsing

Weblurk

3 Answers

Shankar Narayana Damodaran

Balázs Varga

Tronix117

Recent Activity

Donate For Us

Extract html attributes from string in PHP [duplicate]

Tags:

arrays

dom

function

php

parsing

Weblurk

3 Answers

Shankar Narayana Damodaran

Balázs Varga

Tronix117

Related questions

Recent Activity

Donate For Us