Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup css selector code (xpath code included)

I am trying to parse below HTML using jsoup but not able to get the right syntax for it.

<div class="info"><strong>Line 1:</strong> some text 1<br>
  <b>some text 2</b><br>
  <strong>Line 3:</strong> some text 3<br>
</div>

I need to capture some text 1, some text 2 and some text 3 in three different variables.

I have the xpath for first line (which should be similar for line 3) but unable to work out the equivalent css selector.

//div[@class='info']/strong[1]/following::text()

On a separate I have few hundred html files and need to parse and extract data from them to store in a database. Is Jsoup best choice for this?

like image 279
PTS Admin Avatar asked Feb 20 '23 15:02

PTS Admin


1 Answers

It really looks like Jsoup can't handle getting text out of an element with mixed content. Here is a solution that uses the XPath you formulated that uses XOM and TagSoup:

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import nu.xom.XPathContext;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final String html = "<div class=\"info\"><strong>Line 1:</strong> some text 1<br><b>some text 2</b><br><strong>Line 3:</strong> some text 3<br></div>";
        final Parser parser = new Parser();
        final Builder builder = new Builder(parser);
        final Document document = builder.build(html, null);
        final nu.xom.Element root = document.getRootElement();
        final Nodes textElements = root.query("//xhtml:div[@class='info']/xhtml:strong[1]/following::text()", new XPathContext("xhtml", root.getNamespaceURI()));
        for (int textNumber = 0; textNumber < textElements.size(); ++textNumber) {
            System.out.println(textElements.get(textNumber).toXML());
        }
    }
}

This outputs:

 some text 1
some text 2
Line 3:
 some text 3

Without knowing more specifics of what you're trying to do though, I'm not sure if this is exactly what you want.

like image 181
laz Avatar answered Mar 02 '23 19:03

laz