Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate XPath query matching a specific element in Jsoup?

Tags:

_ Hi , this is my web page :

<html>
    <head>
    </head>
    <body>
        <div> text div 1</div>
        <div>
            <span>text of first span </span>
            <span>text of second span </span>
        </div>
        <div> text div 3 </div>
    </body>
</html>

I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :

 Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
 Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
        for (Element element : elements) {
            if (!element.ownText().isEmpty()) {

                StringBuilder path = new StringBuilder(element.nodeName());
                String value = element.ownText();
                Elements p_el = element.parents();

                for (Element el : p_el) {
                    path.insert(0, el.nodeName() + '/');
                }
                all.add(path + " = " + value + "\n");
                System.out.println(path +" = "+ value);
            }
        }

        return all;

my code give me this result :

html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3

in fact i want get result like this :

html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3

please could any one give me idea how to get reach this result :) . thanks in advance.

like image 286
kivok94 Avatar asked Mar 26 '16 09:03

kivok94


People also ask

Can we use XPath in jsoup?

With XPath expressions it is able to select the elements within the HTML using Jsoup as HTML parser.

What is element in jsoup?

A HTML element consists of a tag name, attributes, and child nodes (including text nodes and other elements). From an Element, you can extract data, traverse the node graph, and manipulate the HTML.

What does jsoup clean do?

clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.


1 Answers

As asked here a idea. Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".

Here the possible solution based on your current attempt.

For each (parent) element check if there are more than one element with this name. Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1

like image 173
hr_117 Avatar answered Nov 01 '22 11:11

hr_117