Wikipedia first paragraph

Question

I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract the first paragraph of a Wikipedia article?

Thanks a lot.

João Silva · Accepted Answer

It is very simple, and the process is quite similar for every semi-structured page from which you are extracting information.

First, you have to uniquely identify the DOM element where the required information lies in. The easiest way to do this is to use a web development tool, such as Firebug in Firefox, or the ones that come bundled with IE (> 6, I think) and Chrome.

Using the article Potato as an example, you will find that the <p>aragraph you are interested in is in the following block:

<div class="mw-content-ltr" lang="en" dir="ltr">
  <div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div>
  <div class="dablink">[...]</div>
  <div class="dablink">[...]</div>
  <div>[...]</div>
  <p>The potato [...]</p>
  <p>[...]</p>
  <p>[...]</p>

In other words, you want to find the first <p> element that is inside the div with a class called mw-content-ltr.

Then, you just need to select that element with jsoup, using its selector syntax for example (which is very similar to jQuery's):

public class WikipediaParser {
  private final String baseUrl; 

  public WikipediaParser(String lang) {
    this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
  }

  public String fetchFirstParagraph(String article) throws IOException {
    String url = baseUrl + article;
    Document doc = Jsoup.connect(url).get();
    Elements paragraphs = doc.select(".mw-content-ltr p");

    Element firstParagraph = paragraphs.first();
    return firstParagraph.text();
  }

  public static void main(String[] args) throws IOException {
    WikipediaParser parser = new WikipediaParser("en");
    String firstParagraph = parser.fetchFirstParagraph("Potato");
    System.out.println(firstParagraph); // prints "The potato is a starchy [...]."
  }
}

Wikipedia first paragraph

Tags:

java

parsing

web-scraping

wikipedia

jsoup

Lida

1 Answers

João Silva

Recent Activity

Donate For Us

Wikipedia first paragraph

Tags:

java

parsing

web-scraping

wikipedia

jsoup

Lida

1 Answers

João Silva

Related questions

Recent Activity

Donate For Us