Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get ID using a specific word in regex?

Tags:

regex

php

My string:

<div class="sect1" id="s9781473910270.i101">       
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p> 
</div>
</div>           
<div class="sect1" id="s9781473910270.i103">
<p>sometext [ref*summation]</p>
</div>

<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
</div>
<p>fig1.2 [label*somefigure]</p>
<p>sometext [ref*somefigure]</p>
</div>        

Objective: 1.In the string above label*string and ref*string are the cross references. In the place of [ref*string] I need to replace with a with the atributes of class and href, href is the id of div where related label* resides. And class of a is the class of div

  1. As I mentioned above a element class and ID is their relative div class names and ID. But if div class="metadata" exists, need to ignore it should not take their class name and ID.

Expected output:

<div class="sect1" id="s9781473910270.i101">       
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p> 
</div>
</div>             
<div class="sect1" id="s9781473910270.i103">
<p>sometext <a class="section-ref" href="s9781473910270.i102">1.2</a></p>
</div>


<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
<p>fig1.2 [label*somefigure]</p>
</div>
<p>sometext <a class="fig-ref" href="s9781473910270.i220">fig 1.2</a></p>          
</div>      

How to do it in simpler way without using DOM parser?

My idea is, have to store label* string and their ID in an array and will loop against ref string to match the label* string if string matches then their related id and class should be replaced in the place of ref* string , So I have tried this regex to get label*string and their related id and class name.

like image 441
Learning Avatar asked Jun 05 '15 09:06

Learning


People also ask

How do you search for a word in regex?

To run a “whole words only” search using a regular expression, simply place the word between two word boundaries, as we did with ‹ \bcat\b ›. The first ‹ \b › requires the ‹ c › to occur at the very start of the string, or after a nonword character.

What does ?= * Mean in regex?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1. 1* means any number of ones.


1 Answers

This approach consists to use the html structure to retrieve needed elements with DOMXPath. Regex are used in a second time to extract informations from text nodes or attributes:

$classRel = ['sect2'  => 'section-ref',
             'figure' => 'fig-ref'];

libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTML($html); // or $dom->loadHTMLFile($url); 

$xp = new DOMXPath($dom);

// make a custom php function available for the XPath query
// (it isn't really necessary, but it is more rigorous than writing
// "contains(@class, 'myClass')" )
$xp->registerNamespace("php", "http://php.net/xpath");

function hasClass($classNode, $className) {
    if (!empty($classNode))
        return in_array($className, preg_split('~\s+~', $classNode[0]->value, -1, PREG_SPLIT_NO_EMPTY));
    return false;
}

$xp->registerPHPFunctions('hasClass');


// The XPath query will find the first ancestor of a text node with '[label*'
// that is a div tag with an id and a class attribute,
// if the class attribute doesn't contain the "metadata" class.

$labelQuery = <<<'EOD'
//text()[contains(., 'label*')]
/ancestor::div
[@id and @class and not(php:function('hasClass', @class, 'metadata'))][1]
EOD;

$idNodeList = $xp->query($labelQuery);

$links = [];

// For each div node, a new link node is created in the associative array $links.
// The keys are labels. 
foreach($idNodeList as $divNode) {

    // The pattern extract the first text part in group 1 and the label in group 2
    if (preg_match('~(\S+) .*? \[label\* ([^]]+) ]~x', $divNode->textContent, $m)) {
        $links[$m[2]] = $dom->createElement('a');
        $links[$m[2]]->setAttribute('href', $divNode->getAttribute('id'));
        $links[$m[2]]->setAttribute('class', $classRel[$divNode->getAttribute('class')]);
        $links[$m[2]]->nodeValue = $m[1];
    }
}


if ($links) { // if $links is empty no need to do anything

    $refNodeList = $xp->query("//text()[contains(., '[ref*')]");

    foreach ($refNodeList as $refNode) {
        // split the text with square brackets parts, the reference name is preserved in a capture
        $parts = preg_split('~\[ref\*([^]]+)]~', $refNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);

        // create a fragment to receive text parts and links
        $frag = $dom->createDocumentFragment();

        foreach ($parts as $k=>$part) {
            if ($k%2 && isset($links[$part])) { // delimiters are always odd items
                $clone = $links[$part]->cloneNode(true);
                $frag->appendChild($clone);
            } elseif ($part !== '') {
                $frag->appendChild($dom->createTextNode($part));
            }
        }

        $refNode->parentNode->replaceChild($frag, $refNode);
    }
}

$result = '';

$childNodes = $dom->getElementsByTagName('body')->item(0)->childNodes;

foreach ($childNodes as $childNode) {
    $result .= $dom->saveXML($childNode);
}

echo $result;
like image 117
Casimir et Hippolyte Avatar answered Oct 17 '22 06:10

Casimir et Hippolyte