I find myself having to learn new things all the time. I've been trying to think of ways I could expedite the process of learning new subjects. I thought it might be neat if I could write a program to parse a wikipedia article and remove everything but the most valuable information.
I started by taking the Wikipedia article on PDFs and extracting the first 100 sentences. I gave each sentence a score based on how valuable I thought it was. I ended up creating a file following this format:
<sentence>
<value>
<sentence>
<value>
etc.
I then parsed this file and attempted to find various functions that would correlate each sentence with the value I had given it. I've just begun learning about machine learning and statistics and whatnot, so I'm doing a lot of fumbling around here. This is my latest attempt: https://github.com/JesseAldridge/Wikipedia-Summarizer/blob/master/plot_sentences.py.
I tried a bunch of stuff that didn't seem to produce much of any correlation at all -- average word length, position in the article, etc. Pretty much the only thing that produced any sort of useful relationship was the length of the string (more specifically, counting the number of lowercase letter 'e's seemed to work best). But that seems kind of lame, because it seems obvious that longer sentences would be more likely to contain useful information.
At one point I thought I had found some interesting functions, but then when I tried removing outliers (by only counting the inner quartiles), they turned out to produce worse results then simply returning 0 for every sentence. This got me wondering about how many other things I might be doing wrong... I'm also wondering whether this is even a good way to be approaching this problem.
Do you think I'm on the right track? Or is this just a fool's errand? Are there any glaring deficiencies in the linked code? Does anyone know of a better way to approach the problem of summarizing a Wikipedia article? I'd rather have a quick and dirty solution than something perfect that takes a long time to put together. Any general advice would also be welcome.
To visit the simple Wikipedia website, all you need to do is add 'simple' to the beginning of your Wikipedia url, like so: simple.wikipedia.org. Add 'simple' to the front of any Wikipedia article URL and you'll be taken to a version of that same article, but in simple Wikipedia format. Don't believe me?
An edit summary is a brief explanation of an edit to a Wikipedia page.
Browse to the page you want to download. Make sure you have Desktop view selected. Mobile devices which default to the Mobile view do not display the required options; to switch to Desktop view, scroll to the bottom of the page and select Desktop . In the left sidebar, under Print/export select Download as PDF .
Anyone can create a Wikipedia user account and write an article, on any topic whatsoever. Wikipedia, however, would prefer that topic not be “Myself.” It's right there, clearly stated in their terms of service. Wikipedia entries are like wedding showers.
Considering that your question relates more to a research activity than a programming problem, you should probably look at scientific literature. Here you will find published details of a number of algorithms that perform exactly what you want. A google search for "keyword summarization" finds the following:
Single document Summarization based on Clustering Coefficient and Transitivity Analysis
Multi-document Summarization for Query Answering E-learning System
Intelligent Email: Aiding Users with AI
If you read the above, then follow the references they contain, you will find a whole wealth of information. Certainly enough to build a functional application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With