How to select all the text between 2 tags?

What is the best way to select all the text between 2 tags - ex: the text between all the ' <pre> ' tags on the page. Best way is to use a html-parser like "Beautiful Soup" if you're into python... The best way is to use XML/HTML parser. In general, using regular expressions to parse html is not a good idea: stackoverflow.com/questions/1732348/…

Does regex match open HTML tags?

Everything you ever needed to know about parsing HTML with a regular expression: RegEx match open tags except XHTML self-contained tags. – RobG Sep 7 2015 at 10:02 Add a comment |

How do you use regex in regex?

Regex can be used to select everything between the specified characters. This can be useful for things like extracting contents of parentheses like (abc) or for extracting folder names from a file path (e.g. C:/documents/work/). A regular expression that matches all characters between two specified characters makes use of look-ahead (?=…)

How to use regex to match everything between two strings using Java?

RegEx to match everything between two strings using the Java approach. Let's use Pattern and Matcher objects to use RegEx (.?)*. Since Matcher might contain more than one match, we need to loop over the results and store it. while (m.find ()) { //Loop through all matches results.add (m.group ()); //Get value and store in collection.

Regex select all text between tags

Tags:

html

regex

html-parsing

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.

As other commenters have suggested, if you're doing something complex, use a HTML parser.

Tag can be completed in another line. This is why \n needs to be added.

<PRE>(.|\n)*?<\/PRE>

This is what I would use.

(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))

Basically what it does is:

(?<=(<pre>)) Selection have to be prepend with <pre> tag

(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character | simply means "OR".

+? Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.

(?=(</pre>)) Selection have to be appended by the </pre> tag

enter image description here

Depending on your use case you might need to add some modifiers like (i or m)

i - case-insensitive
m - multi-line search

Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.

Javascript does not support lookbehind

The above example should work fine with languages such as PHP, Perl, Java ...
Javascript however does not support lookbehind so we have to forget about using `(?))` and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here https://stackoverflow.com/questions/11592033/regex-match-text-between-tags

Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses

To exclude the delimiting tags:

(?<=<pre>)(.*?)(?=</pre>)

(?<=<pre>) looks for text after <pre>

(?=</pre>) looks for text before </pre>

Results will text inside pre tag

use the below pattern to get content between element. Replace [tag] with the actual element you wish to extract the content from.

<[tag]>(.+?)</[tag]>

Sometime tags will have attributes, like anchor tag having href, then use the below pattern.

 <[tag][^>]*>(.+?)</[tag]>

This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.

(?<=>)([\w\s]+)(?=<\/)

I tested it in https://regex101.com/ using this HTML fragment.

<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>

It's a game of three parts: the look behind, the content, and the look ahead.

(?<=>)    # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/)   # look ahead  (but don't consume/capture) for a '</'

screen capture from regex101.com

I hope that serves as a started for 10. Luck.

You shouldn't be trying to parse html with regexes see this question and how it turned out.

In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.

Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:

preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )

A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:

$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();

And since this is a proper parser it will be able to handle nesting tags etc.

Related questions
                            
                                How to use jQuery to select a dropdown option?
                            
                                HTML Script tag: type or language (or omit both)?
                            
                                iPad Safari scrolling causes HTML elements to disappear and reappear with a delay
                            
                                Soft hyphen in HTML (<wbr> vs. &shy;)
                            
                                How to include a font .ttf using CSS?
                            
                                How can I select and upload multiple files with HTML and PHP, using HTTP POST?
                            
                                Make header and footer files to be included in multiple html pages
                            
                                How to add line breaks to an HTML textarea?
                            
                                How to completely remove borders from HTML table
                            
                                Text overflow ellipsis on two lines
                            
                                How to align flexbox columns left and right?
                            
                                How to validate inputs dynamically created using ng-repeat, ng-show (angular)
                            
                                How to make HTML input tag only accept numerical values?
                            
                                Remove leading zeros from a number in Javascript [duplicate]
                            
                                What is the attribute property="og:title" inside meta tag?
                            
                                Do I encode ampersands in <a href...>?
                            
                                Using CSS :before and :after pseudo-elements with inline CSS?
                            
                                Can you target <br /> with css?
                            
                                Default html form focus without JavaScript
                            
                                How do I auto-submit an upload form when a file is selected?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With