Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is regex the right tool to find a line of HTML?

Tags:

html

regex

php

I have a PHP script that pulls some content off of a server, but the problem is that the line on which the content is changes every day, so I can't just pull a specific line. However, the content is contained within a div that has a unique id. Is it possible (and is it the best way) for regex to search for this unique id and then pass the line of which it's on back to my script?

Example:

HTML file:

<html><head><title>Example</title></head>
<body>
<div id="Alpha"> Blah blah blah </div>
<div id="Beta"> Blah Blah Blah </div>
</body>
</html>

So let's say that I'm looking for the line with an opening div tag with an id of alpha. The code should return 3, because on the third line is the div with the id of alpha.

like image 735
waiwai933 Avatar asked Dec 17 '22 04:12

waiwai933


2 Answers

At the risk of providing more up-votes for Jeff who has already crossed the mountains of madness... see here

The argument rages back and forth, but... it's is a simple one-off or little used script you are writing then sure use regex, if it's more complex and needs to be reliable with little future tweaking then I'd suggest using an HTML parser. HTML is a nasty often non-regular beast to tame. Use the right tool for the job... maybe in your case it's regex, or maybe its a full blown parser.

like image 116
beggs Avatar answered Jan 02 '23 23:01

beggs


Generally, NO. But if you are sure that the div will always be one line or there is not another div inside it, you can use it without problem. Something like /<div id=\"mydivid\">(.*?)</div>/ or something similar.

Otherwise, DOMDocument would be a more sane way.

EDIT See from your HTML example. My answer would be "YES". RegEx is a very good tool for this.

I assume that you have the HTML as a continuous text not as lines (which will be slightly different). I also assume that you want the line number more that the line content.

Here is a rought PHP code to extract it. (just to give some idea)

$HTML =
"<html><head><title>Example</title></head>
<body>
<div id=\"Alpha\"> Blah blah blah </div>
<div id=\"Beta\"> Blah Blah Blah </div>
</body>
</html>";

$ID = "Alpha";

function GetLineOfDIV($HTML, $ID) {
    $RegEx_Alpha = '/\n(<div id="'.$ID.'">.*?<\/div>)\n/m';
    $Index       = preg_match($RegEx_Alpha, $HTML, $Match, PREG_OFFSET_CAPTURE);
    $Match       = $Match[1]; // Only the one in '(...)'
    if ($Match == "")
        return -1;

    //$MatchStr    = $Match[0]; Since you do not want it, so we comment it out.
    $MatchOffset = $Match[1];

    $StartLines = preg_split("/\n/", $HTML, -1, PREG_SPLIT_OFFSET_CAPTURE);
    foreach($StartLines as $I => $StartLine) {
        $LineOffset = $StartLine[1];
        if ($MatchOffset <= $LineOffset)
            return $I + 1;
    }
    return count($StartLines);
}

echo GetLineOfDIV($HTML, $ID);

I hope I give you some idea.

like image 45
NawaMan Avatar answered Jan 02 '23 21:01

NawaMan