You can use "<pre>(.*?)</pre>"
, (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.
As other commenters have suggested, if you're doing something complex, use a HTML parser.
Tag can be completed in another line. This is why \n
needs to be added.
<PRE>(.|\n)*?<\/PRE>
(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))
Basically what it does is:
(?<=(<pre>))
Selection have to be prepend with <pre>
tag
(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| )
This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character |
simply means "OR".
+?
Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.
(?=(</pre>))
Selection have to be appended by the </pre>
tag
Depending on your use case you might need to add some modifiers like (i or m)
Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.
Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses
To exclude the delimiting tags:
(?<=<pre>)(.*?)(?=</pre>)
(?<=<pre>)
looks for text after <pre>
(?=</pre>)
looks for text before </pre>
Results will text inside pre
tag
use the below pattern to get content between element. Replace [tag]
with the actual element you wish to extract the content from.
<[tag]>(.+?)</[tag]>
Sometime tags will have attributes, like anchor
tag having href
, then use the below pattern.
<[tag][^>]*>(.+?)</[tag]>
This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.
(?<=>)([\w\s]+)(?=<\/)
I tested it in https://regex101.com/ using this HTML fragment.
<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>
It's a game of three parts: the look behind, the content, and the look ahead.
(?<=>) # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/) # look ahead (but don't consume/capture) for a '</'
I hope that serves as a started for 10. Luck.
You shouldn't be trying to parse html with regexes see this question and how it turned out.
In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.
Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:
preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )
A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:
$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();
And since this is a proper parser it will be able to handle nesting tags etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With