Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract html attributes from string in PHP [duplicate]

I have a variable that looks like this:

$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>'

and I want to extract the data-tpl-attributes in a way so I end up with a resulting array that looks like this:

$array = (
    'classname' => 'class',
    'title' => 'innerHTML'
)

The number of "data-tpl-" attributes varies, and it's not always an <li> element. Other than that, it always follows the same format: data-tpl-attributename="attributePlacement".

How can I retrieve those attributes and store them in an array, without using regex? I say without regex since everywhere I look it seems like parsing html using regex is an evil practice, or is it ok in this case?

like image 398
Weblurk Avatar asked Apr 24 '14 07:04

Weblurk


3 Answers

You can very well make use of a DOMDocument class and yeah don't use regular expressions. This is just a start and you can very well explore it.

<?php
$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>';
echo "<pre>";

function parseTag($content,$tg)
{
    $dom = new DOMDocument;
    $dom->loadHTML($content);
    $attr = array();
    foreach ($dom->getElementsByTagName($tg) as $tag) {
        foreach ($tag->attributes as $attribName => $attribNodeVal)
        {
           $attr[$attribName]=$tag->getAttribute($attribName);
        }
    }
    return $attr;
}

$attrib_arr = parseTag($var,'li');
print_r($attrib_arr);

OUTPUT :

Array
(
    [data-tpl-classname] => class
    [data-tpl-title] => innerHTML
)

Demo

like image 179
Shankar Narayana Damodaran Avatar answered Nov 04 '22 06:11

Shankar Narayana Damodaran


You can extract the values by using some string functions. It looks like this:

$test1 = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>';
$test2 = '<div data-tpl-anything="something" data-tpl-title="this is a title" data-tpl-third="asdasd"></div>';

var_dump(extract_tpl($test1));
var_dump(extract_tpl($test2));

function extract_tpl($string,$prefix="data-tpl-") {
    $start = 0;
    $end = 0;

    while(strpos($string,$prefix,$end))
    {
        $start = strpos($string,$prefix,$start)+strlen($prefix);
        $end = strpos($string,'"',$start)-1;
        $end2 = strpos($string,'"',$end+2);
        $array[substr($string,$start,$end-$start)] = substr($string,$end+2,$end2-$end-2);
    }

    return $array;
}

Output:

array (size=2)
  'classname' => string 'class' (length=5)
  'title' => string 'innerHTML' (length=9)

array (size=3)
  'anything' => string 'something' (length=9)
  'title' => string 'this is a title' (length=15)
  'third' => string 'asdasd' (length=6)

The numbers in the code ( -1, +2, ... ) is for skipping the symbols like " .

like image 39
Balázs Varga Avatar answered Nov 04 '22 07:11

Balázs Varga


It's evil without being it completely, of course, it may be slow on big strings or on really complex regexp, which is not your case. And it is still (more readable?), easier and quicker to implement than HTML or XML parser, which are not more optimized than a simple regexp match.

$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>'
preg_match_all("data-tpl-([^"]*)="([^"]*)"/i", $str, $matches);

$array = array();
for($i = 1, $size = count($matches); $i < $size; ++$i){
  $array[$matches[$i][0]] = $matches[$i][1];
}

I used [^"]* instead of .*? since it is a bit quicker.


Note: I just made a benchmark. Compared to the first answer using DOMDocument, this code using Regexp is 4 time faster, but less cleaner since parsing Dom using regexp may lead to misinterpretations of the markup. And it is slightly slower than the answer using str functions (but easier to read and to maintain).

Note 2: Of course use this solution only if there will never be any confusion and if you are sure of the input format, in the contrary the solution with DOMDocument is cleaner.


Why regular expression should be used wisely or avoided when parsing HTML:

http://blog.codinghorror.com/parsing-html-the-cthulhu-way

Use them with that in mind:

  • It's generally a bad idea.
  • Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it.
  • I had what I thought to be good, rational, (semi) defensible reasons for choosing regular expressions in this specific scenario.
like image 2
Tronix117 Avatar answered Nov 04 '22 07:11

Tronix117