I have to extract some data from a web page using Jsoup.
I have easily extracted the data contained in tags, but I still need some data which is not tagged.
This is an example of the HTML source :
<a id="aId" href="aLink" style="aStyle">
<span id="spanId1">
<b>Caldan Therapeutics</b>
Announces Key Appointments And A Collaboration With
<b>Sygnature Discovery</b>
</span>
<span id="spanId2" style="spanStyle2">
5/17/2016
</span>
</a>
I have already extracted the data contained in <b>
tags as well as the date but what I want now is to extract the sentence Announces Key Appointments And A Collaboration With
.
As you can see, this sentence has no tags.
What can I do to extract it ?
I have already done my research and all I could find was how to strip all the tags out.
Thanks for your help!
link.text () − text () method retrives the element text. Element object represent a dom elment and provides various method to get the text of a dom element. Create the following java program using any editor of your choice in say C:/> jsoup. Now run the JsoupTester to see the result.
You can also modify and write HTML out safely too. jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF. jsoup is packaged as a single jar with no other dependencies, so you can add it to any Java project so long as you’re using Java 7 or later.
You can extract data by using CSS selectors, or by navigating and modifying the Document Object Model directly - just like a browser does, except you do it in Java code. You can also modify and write HTML out safely too. jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF.
This tree works the same way as the DOM in a browser, offering methods similar to jQuery and vanilla JavaScript to select, traverse, manipulate text/HTML/attributes and add/remove elements. If you're comfortable with client-side selectors and DOM traversing/manipulation, you'll find jsoup very familiar.
I found an anwser to that specific need and I would like to share it with anyone who might face the same issue in the future.
All you can do is use the function ownText()
, it exclueds the text from the element's children tags.
In our example :
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://source-url").get();
Elements spanTags = doc.getElementsByTag("span");
for (Element spanTag : spanTags) {
String text = spanTag.ownText();
System.out.println(text);
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With