I want to extract texts from HTML page(s) which placed in p
and li
tags, so I can start to tokenize the page to construct inverted index(es) for each page in order to answer search queries.
How I can get p
tags using jsoup
Elements e = doc.select("");
What could be the string to be written in that parameter?
The preg_match() function is the best option to extract text between HTML tags with REGEX in PHP. If you want to get content between tags, use regular expressions with preg_match() function in PHP. You can also extract the content inside element based on class name or ID using PHP.
The recommended way to extract information from a markup language is to use a parser, for instance Beautiful Soup is a good choice. Avoid using regular expressions for this, it's not the right tool for the job! So probably on the lines of, var. findall(text = True) ?
If you want to find all HTML elements that match a specified CSS selector (id, class names, types, attributes, values of attributes, etc), use the querySelectorAll() method. This example returns a list of all <p> elements with class="intro" .
The idea of the tag <p></p> is to display a paragraph. So HTML offers you the <div></div> which is a container conecpt. So you should use Salman A's Solution, because there aren't just different tags in html for no reason.
This can do the job
Elements e=doc.select("p");
Here is a list of all selectors you can use.
Suppose you have this html:
String html="<p>some <strong>bold</strong> text</p>";
To get some bold text
as result you should use:
Document doc = Jsoup.parse(html);
Element p= doc.select("p").first();
String text = doc.body().text(); //some bold text
or
String text = p.text(); //some bold text
Suppose now you have the following complex html
String html="<div id=someid><p>some text</p><span>some other text</span><p> another p tag</p></div>"
To get the values from the two p
tags you have to do something like this
Document doc = Jsoup.parse(html);
Element content = doc.getElementById("someid");
Elements p= content.getElementsByTag("p");
String pConcatenated="";
for (Element x: p) {
pConcatenated+= x.text();
}
System.out.println(pConcatenated);//sometext another p tag
You can find more info here also
Hope this helped
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With