I have a variable that looks like this:
$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>'
and I want to extract the data-tpl-attributes in a way so I end up with a resulting array that looks like this:
$array = (
'classname' => 'class',
'title' => 'innerHTML'
)
The number of "data-tpl-" attributes varies, and it's not always an <li>
element. Other than that, it always follows the same format: data-tpl-attributename="attributePlacement"
.
How can I retrieve those attributes and store them in an array, without using regex? I say without regex since everywhere I look it seems like parsing html using regex is an evil practice, or is it ok in this case?
You can very well make use of a DOMDocument class and yeah don't use regular expressions. This is just a start and you can very well explore it.
<?php
$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>';
echo "<pre>";
function parseTag($content,$tg)
{
$dom = new DOMDocument;
$dom->loadHTML($content);
$attr = array();
foreach ($dom->getElementsByTagName($tg) as $tag) {
foreach ($tag->attributes as $attribName => $attribNodeVal)
{
$attr[$attribName]=$tag->getAttribute($attribName);
}
}
return $attr;
}
$attrib_arr = parseTag($var,'li');
print_r($attrib_arr);
OUTPUT :
Array
(
[data-tpl-classname] => class
[data-tpl-title] => innerHTML
)
Demo
You can extract the values by using some string functions. It looks like this:
$test1 = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>';
$test2 = '<div data-tpl-anything="something" data-tpl-title="this is a title" data-tpl-third="asdasd"></div>';
var_dump(extract_tpl($test1));
var_dump(extract_tpl($test2));
function extract_tpl($string,$prefix="data-tpl-") {
$start = 0;
$end = 0;
while(strpos($string,$prefix,$end))
{
$start = strpos($string,$prefix,$start)+strlen($prefix);
$end = strpos($string,'"',$start)-1;
$end2 = strpos($string,'"',$end+2);
$array[substr($string,$start,$end-$start)] = substr($string,$end+2,$end2-$end-2);
}
return $array;
}
Output:
array (size=2)
'classname' => string 'class' (length=5)
'title' => string 'innerHTML' (length=9)
array (size=3)
'anything' => string 'something' (length=9)
'title' => string 'this is a title' (length=15)
'third' => string 'asdasd' (length=6)
The numbers in the code ( -1, +2, ... ) is for skipping the symbols like " .
It's evil without being it completely, of course, it may be slow on big strings or on really complex regexp, which is not your case. And it is still (more readable?), easier and quicker to implement than HTML or XML parser, which are not more optimized than a simple regexp match.
$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>'
preg_match_all("data-tpl-([^"]*)="([^"]*)"/i", $str, $matches);
$array = array();
for($i = 1, $size = count($matches); $i < $size; ++$i){
$array[$matches[$i][0]] = $matches[$i][1];
}
I used [^"]*
instead of .*?
since it is a bit quicker.
Note: I just made a benchmark. Compared to the first answer using DOMDocument, this code using Regexp is 4 time faster, but less cleaner since parsing Dom using regexp may lead to misinterpretations of the markup. And it is slightly slower than the answer using str
functions (but easier to read and to maintain).
Note 2: Of course use this solution only if there will never be any confusion and if you are sure of the input format, in the contrary the solution with DOMDocument is cleaner.
Why regular expression should be used wisely or avoided when parsing HTML:
http://blog.codinghorror.com/parsing-html-the-cthulhu-way
Use them with that in mind:
- It's generally a bad idea.
- Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it.
- I had what I thought to be good, rational, (semi) defensible reasons for choosing regular expressions in this specific scenario.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With