Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to automatically determine text quality?

Tags:

A lot of Natural Language Processing (NLP) algorithms and libraries have a hard time working with random texts from the web, usually because they are presupposing clean, articulate writing. I can understand why that would be easier than parsing YouTube comments.

My question is: given a random piece of text, is there a process to determine whether that text is well written, and is a good candidate for use in NLP? What is the general name for these algorithm?

I would appreciate links to articles, algorithms or code libraries, but I would settle for good search terms.

like image 289
itsadok Avatar asked Feb 15 '10 08:02

itsadok


2 Answers

One easy thing to try would be to classify the text as well written or not using a n-gram language model. To do this, you would first train a language model on a collection of well written text. Given a new piece of text, you could then run the model over it and only pass it on to other downstream NLP tools if the per word perplexity is sufficiently low (i.e., if it looks sufficiently similar to the well written training text).

To get the best results, you should probably train your n-gram language model on text that is similar to whatever was used to train the other NLP tools you're using. That is, if you're using a phrase structure parser trained on newswire, then you should also train your n-gram language model on newswire.

In terms of software toolkits you could use for something like this, SRILM would be a good place to start.

However, an alternative solution would be to try to adapt whatever NLP tools you're using to the text you want to process. One approach for something like this would be self-training, whereby you run your NLP tools over the type of data you would like to process and then retrain them on their own output. For example, McClosky et al 2006 used this technique to take a parser originally trained on the Wall Street Journal and adapt it parsing biomedical text.

like image 38
dmcer Avatar answered Sep 21 '22 20:09

dmcer


'Well written' and 'good for NLP' may go together but don't have to. For a text to be 'good for NLP', it maybe should contain whole sentences with a verb and a dot at the end, and it should perhaps convey some meaning. For a text to be well written it should also be well-structured, cohesive, coherent, correctly substitute nouns for pronouns, etc. What you need depends on your application.

The chances of a sentence to be properly processed by an NLP tool can often be estimated by some simple heuristics: Is it too long (>20 or 30 words, depending on the language)? Too short? Does it contain many weird characters? Does it contain urls or email adresses? Does it have a main verb? Is it just a list of something? To my knowledge, there is no general name for this, nor any particular algorithm for this kind of filtering - it's called 'preprocessing'.

As to a sentence being well-written: some work has been done on automatically evaluating readability, cohesion, and coherence, e.g. the articles by Miltsakaki (Evaluation of text coherence for electronic essay scoring systems and Real-time web text classification and analysis of reading difficulty) or Higgins (Evaluating multiple aspects of coherence in student essays). These approaches are all based on one or the other theory of discourse structure, such as Centering Theory. The articles are rather theory-heavy and assume knowledge of both centering theory as well as machine learning. Nonetheless, some of these techniques have successfully been applied by ETS to automatically scoring student's essays and I think this is quite similar to what you are trying to do, or at least, you may be able to adapt a few ideas.

All this being said, I believe that within the next years, NLP will have to develop techniques to process language which is not well-formed with respect to current standards. There is a massive amount of extremely valuable data out there on the web, consisting of exactly the kinds of text you mentioned: youtube comments, chat messages, twitter and facebook status messages, etc. All of them potentially contain very interesting information. So, who should adapt - the people wrting that way or NLP?

like image 149
ferdystschenko Avatar answered Sep 22 '22 20:09

ferdystschenko