Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wikipedia first paragraph

I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract the first paragraph of a Wikipedia article?

Thanks a lot.

like image 976
Lida Avatar asked Dec 21 '22 05:12

Lida


1 Answers

It is very simple, and the process is quite similar for every semi-structured page from which you are extracting information.

First, you have to uniquely identify the DOM element where the required information lies in. The easiest way to do this is to use a web development tool, such as Firebug in Firefox, or the ones that come bundled with IE (> 6, I think) and Chrome.

Using the article Potato as an example, you will find that the <p>aragraph you are interested in is in the following block:

<div class="mw-content-ltr" lang="en" dir="ltr">
  <div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div>
  <div class="dablink">[...]</div>
  <div class="dablink">[...]</div>
  <div>[...]</div>
  <p>The potato [...]</p>
  <p>[...]</p>
  <p>[...]</p>

In other words, you want to find the first <p> element that is inside the div with a class called mw-content-ltr.

Then, you just need to select that element with jsoup, using its selector syntax for example (which is very similar to jQuery's):

public class WikipediaParser {
  private final String baseUrl; 

  public WikipediaParser(String lang) {
    this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
  }

  public String fetchFirstParagraph(String article) throws IOException {
    String url = baseUrl + article;
    Document doc = Jsoup.connect(url).get();
    Elements paragraphs = doc.select(".mw-content-ltr p");

    Element firstParagraph = paragraphs.first();
    return firstParagraph.text();
  }

  public static void main(String[] args) throws IOException {
    WikipediaParser parser = new WikipediaParser("en");
    String firstParagraph = parser.fetchFirstParagraph("Potato");
    System.out.println(firstParagraph); // prints "The potato is a starchy [...]."
  }
}
like image 59
João Silva Avatar answered Jan 06 '23 19:01

João Silva