Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing article abstracts

I'm looking for a way to automatically produce an abstract, basically the first few sentances/paragraphs of a blog entry, to display in a list of articles (which are written in markdown). Currently, I'm doing something like this:

def abstract(article, paras=3):
    return '\n'.join(article.split('\n')[0:paras])

to just grab the first few lines worth of text, but i'm not totally happy with the results.

What I'm really looking for is to end up with about 1/3 of a screenful of formatted text to display in the list of entries, but using the algorithm above, the amount pulled ends up with wildly varying amounts, as little as a line or two, is frequently mixed with more ideal sized abstracts.

Is there a library that's good at this kind of thing? if not, do you have any suggestions to improve the output?

like image 445
SingleNegationElimination Avatar asked Nov 04 '09 19:11

SingleNegationElimination


People also ask

How do you write an abstract for an article?

The abstract should begin with a brief but precise statement of the problem or issue, followed by a description of the research method and design, the major findings, and the conclusions reached.

Where do you find an articles abstract?

The abstract is always at the beginning of the article and will either be labeled "abstract" or will be set apart from the rest of the article by a different font or margins.

What is abstract in research article?

What is an abstract? An abstract is a concise summary of a research paper or entire thesis. It is an original work, not an excerpted passage. An abstract must be fully self-contained and make sense by itself, without further reference to outside sources or to the actual paper.


1 Answers

EDIT:

You can do something like this:

from textwrap import wrap

def getAbstract(text, lines=5, screenwidth=100):
    width = len(' '.join([
               line for block in text.splitlines()
               for line in wrap(block, width=screenwidth)
            ][:lines]))
    return text[:width] + '...'

This makes use of the textwrap algorithm to get the ideal text length. It will break the text into screen-sized lines and use them to calculate the length of the desirable number of lines.

For example applying this algorithm on the python wikipedia page entry:

print getAbstract(text, lines=7)

will give you this output:

Python is a general-purpose high-level programming language.2 Its design philosophy emphasizes code readability.[3] Python claims to "[combine] remarkable power with very clear syntax",[4] and its standard library is large and comprehensive. Its use of indentation as block delimiters is unusual among popular programming languages.

Python supports multiple programming paradigms (primarily object oriented, imperative, and functional) and features a fully dynamic type system and automatic memory management, similar to Perl, Ruby, Scheme, and Tcl. Like other dynamic languages, Python is often used as a scripting...


Without further details it's hard to help you. But if your problem was that taking the first few lines was too much for some entries you may need to have a look at textwrap

For example if you only want 100 character abstracts you can do the following:

import textwrap

abstract = textwrap.wrap(text, 100)[0]

That will also replace newlines with spaces which might be desirable depending on your requirements.

like image 194
Nadia Alramli Avatar answered Sep 24 '22 19:09

Nadia Alramli