Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract texts between <p> tags

I want to extract texts from HTML page(s) which placed in p and li tags, so I can start to tokenize the page to construct inverted index(es) for each page in order to answer search queries.

How I can get p tags using jsoup

Elements e = doc.select(""); 

What could be the string to be written in that parameter?

like image 995
rena-c Avatar asked May 23 '13 11:05

rena-c


People also ask

How do you get text between tags in HTML?

The preg_match() function is the best option to extract text between HTML tags with REGEX in PHP. If you want to get content between tags, use regular expressions with preg_match() function in PHP. You can also extract the content inside element based on class name or ID using PHP.

How do you get text between tags in Python?

The recommended way to extract information from a markup language is to use a parser, for instance Beautiful Soup is a good choice. Avoid using regular expressions for this, it's not the right tool for the job! So probably on the lines of, var. findall(text = True) ?

How will get all the matching tags in a HTML file?

If you want to find all HTML elements that match a specified CSS selector (id, class names, types, attributes, values of attributes, etc), use the querySelectorAll() method. This example returns a list of all <p> elements with class="intro" .

How do I display P tags on the same line?

The idea of the tag <p></p> is to display a paragraph. So HTML offers you the <div></div> which is a container conecpt. So you should use Salman A's Solution, because there aren't just different tags in html for no reason.


1 Answers

This can do the job

Elements e=doc.select("p"); 

Here is a list of all selectors you can use.

Suppose you have this html:

String html="<p>some <strong>bold</strong> text</p>";

To get some bold text as result you should use:

Document doc = Jsoup.parse(html);
Element p= doc.select("p").first();
String text = doc.body().text(); //some bold text

or

String text = p.text(); //some bold text

Suppose now you have the following complex html

String html="<div id=someid><p>some text</p><span>some other text</span><p> another p tag</p></div>"

To get the values from the two p tags you have to do something like this

Document doc = Jsoup.parse(html);
Element content = doc.getElementById("someid");
Elements p= content.getElementsByTag("p");

String pConcatenated="";
for (Element x: p) {
  pConcatenated+= x.text();
}

System.out.println(pConcatenated);//sometext another p tag

You can find more info here also

Hope this helped

like image 180
MaVRoSCy Avatar answered Oct 12 '22 23:10

MaVRoSCy