I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract the first paragraph of a Wikipedia article?
Thanks a lot.
It is very simple, and the process is quite similar for every semi-structured page from which you are extracting information.
First, you have to uniquely identify the DOM element where the required information lies in. The easiest way to do this is to use a web development tool, such as Firebug in Firefox, or the ones that come bundled with IE (> 6, I think) and Chrome.
Using the article Potato as an example, you will find that the <p>
aragraph you are interested in is in the following block:
<div class="mw-content-ltr" lang="en" dir="ltr">
<div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div>
<div class="dablink">[...]</div>
<div class="dablink">[...]</div>
<div>[...]</div>
<p>The potato [...]</p>
<p>[...]</p>
<p>[...]</p>
In other words, you want to find the first <p>
element that is inside the div
with a class
called mw-content-ltr
.
Then, you just need to select that element with jsoup, using its selector syntax for example (which is very similar to jQuery's):
public class WikipediaParser {
private final String baseUrl;
public WikipediaParser(String lang) {
this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
}
public String fetchFirstParagraph(String article) throws IOException {
String url = baseUrl + article;
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
return firstParagraph.text();
}
public static void main(String[] args) throws IOException {
WikipediaParser parser = new WikipediaParser("en");
String firstParagraph = parser.fetchFirstParagraph("Potato");
System.out.println(firstParagraph); // prints "The potato is a starchy [...]."
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With