Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summarizing a Wikipedia Article

I find myself having to learn new things all the time. I've been trying to think of ways I could expedite the process of learning new subjects. I thought it might be neat if I could write a program to parse a wikipedia article and remove everything but the most valuable information.

I started by taking the Wikipedia article on PDFs and extracting the first 100 sentences. I gave each sentence a score based on how valuable I thought it was. I ended up creating a file following this format:

<sentence>
<value>
<sentence>
<value>
etc.

I then parsed this file and attempted to find various functions that would correlate each sentence with the value I had given it. I've just begun learning about machine learning and statistics and whatnot, so I'm doing a lot of fumbling around here. This is my latest attempt: https://github.com/JesseAldridge/Wikipedia-Summarizer/blob/master/plot_sentences.py.

I tried a bunch of stuff that didn't seem to produce much of any correlation at all -- average word length, position in the article, etc. Pretty much the only thing that produced any sort of useful relationship was the length of the string (more specifically, counting the number of lowercase letter 'e's seemed to work best). But that seems kind of lame, because it seems obvious that longer sentences would be more likely to contain useful information.

At one point I thought I had found some interesting functions, but then when I tried removing outliers (by only counting the inner quartiles), they turned out to produce worse results then simply returning 0 for every sentence. This got me wondering about how many other things I might be doing wrong... I'm also wondering whether this is even a good way to be approaching this problem.

Do you think I'm on the right track? Or is this just a fool's errand? Are there any glaring deficiencies in the linked code? Does anyone know of a better way to approach the problem of summarizing a Wikipedia article? I'd rather have a quick and dirty solution than something perfect that takes a long time to put together. Any general advice would also be welcome.

like image 856
Jesse Aldridge Avatar asked Jan 01 '12 02:01

Jesse Aldridge


People also ask

How do you simplify Wikipedia articles?

To visit the simple Wikipedia website, all you need to do is add 'simple' to the beginning of your Wikipedia url, like so: simple.wikipedia.org. Add 'simple' to the front of any Wikipedia article URL and you'll be taken to a version of that same article, but in simple Wikipedia format. Don't believe me?

What is edit summary in Wikipedia?

An edit summary is a brief explanation of an edit to a Wikipedia page.

How do I download a Wikipedia article as a PDF?

Browse to the page you want to download. Make sure you have Desktop view selected. Mobile devices which default to the Mobile view do not display the required options; to switch to Desktop view, scroll to the bottom of the page and select Desktop . In the left sidebar, under Print/export select Download as PDF .

Can I write a Wikipedia article about myself?

Anyone can create a Wikipedia user account and write an article, on any topic whatsoever. Wikipedia, however, would prefer that topic not be “Myself.” It's right there, clearly stated in their terms of service. Wikipedia entries are like wedding showers.


1 Answers

Considering that your question relates more to a research activity than a programming problem, you should probably look at scientific literature. Here you will find published details of a number of algorithms that perform exactly what you want. A google search for "keyword summarization" finds the following:

Single document Summarization based on Clustering Coefficient and Transitivity Analysis

Multi-document Summarization for Query Answering E-learning System

Intelligent Email: Aiding Users with AI

If you read the above, then follow the references they contain, you will find a whole wealth of information. Certainly enough to build a functional application.

like image 147
ColinE Avatar answered Oct 14 '22 00:10

ColinE