Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get text without tags from web page using Jsoup

Tags:

java

jsoup

I have to extract some data from a web page using Jsoup.

I have easily extracted the data contained in tags, but I still need some data which is not tagged.

This is an example of the HTML source :

<a id="aId" href="aLink" style="aStyle">
    <span id="spanId1">
        <b>Caldan Therapeutics</b> 
        Announces Key Appointments And A Collaboration With 
        <b>Sygnature Discovery</b>  
    </span>
    <span id="spanId2" style="spanStyle2">
        5/17/2016
    </span>
</a>

I have already extracted the data contained in <b> tags as well as the date but what I want now is to extract the sentence Announces Key Appointments And A Collaboration With.

As you can see, this sentence has no tags.

What can I do to extract it ?

I have already done my research and all I could find was how to strip all the tags out.

Thanks for your help!

like image 256
user1885868 Avatar asked May 18 '16 08:05

user1885868


People also ask

How to get the text of an element using jsouptester?

link.text () − text () method retrives the element text. Element object represent a dom elment and provides various method to get the text of a dom element. Create the following java program using any editor of your choice in say C:/> jsoup. Now run the JsoupTester to see the result.

Is it safe to use jsoup in Java?

You can also modify and write HTML out safely too. jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF. jsoup is packaged as a single jar with no other dependencies, so you can add it to any Java project so long as you’re using Java 7 or later.

How do I extract data from a jsoup file?

You can extract data by using CSS selectors, or by navigating and modifying the Document Object Model directly - just like a browser does, except you do it in Java code. You can also modify and write HTML out safely too. jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF.

How does jsoup work?

This tree works the same way as the DOM in a browser, offering methods similar to jQuery and vanilla JavaScript to select, traverse, manipulate text/HTML/attributes and add/remove elements. If you're comfortable with client-side selectors and DOM traversing/manipulation, you'll find jsoup very familiar.


1 Answers

I found an anwser to that specific need and I would like to share it with anyone who might face the same issue in the future.

All you can do is use the function ownText(), it exclueds the text from the element's children tags.

In our example :

public static void main(String[] args) throws Exception {
    Document doc = Jsoup.connect("http://source-url").get();
    Elements spanTags = doc.getElementsByTag("span");
    for (Element spanTag : spanTags) {
        String text = spanTag.ownText();
        System.out.println(text);
    }
}
like image 150
user1885868 Avatar answered Oct 17 '22 01:10

user1885868