Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting the most fluent text from a set of possibilities via grammar checking (Python)

Some background

I am a literature student at New College of Florida, currently working on an overly ambitious creative project. The project is geared towards the algorithmic generation of poetry. It's written in Python. My Python knowledge and Natural Language Processing knowledge come only from teaching myself things through the internet. I've been working with this stuff for about a year, so I'm not helpless, but at various points I've had trouble moving forward in this project. Currently, I am entering the final phases of development, and have hit a little roadblock.

I need to implement some form of grammatical normalization, so that the output doesn't come out as un- conjugated/inflected caveman-speak. About a month ago some friendly folks on SO gave me some advice on how I might solve this issue by using an ngram language modeller, basically -- but I'm looking for yet other solutions, as it seems that NLTK's NgramModeler is not fit for my needs. (The possibilities of POS tagging were also mentioned, but my text may be too fragmentary and strange for an implementation of such to come easy, given my amateur-ness.)

Perhaps I need something like AtD, but hopefully less complex

I think need something that works like After the Deadline or Queequeg, but neither of these seem exactly right. Queequeg is probably not a good fit -- it was written in 2003 for Unix and I can't get it working on Windows for the life of me (have tried everything). But I like that all it checks for is proper verb conjugation and number agreement.

On the other hand, AtD is much more rigorous, offering more capabilities than I need. But I can't seem to get the python bindings for it working. (I get 502 errors from the AtD server, which I'm sure are easy to fix, but my application is going to be online, and I'd rather avoid depending on another server. I can't afford to run an AtD server myself, because the number of "services" my application is going to require of my web host is already threatening to cause problems in getting this application hosted cheaply.)

Things I'd like to avoid

Building Ngram language models myself doesn't seem right for the task. my application throws a lot of unknown vocabulary, skewing all the results. (Unless I use a corpus that's so large that it runs way too slow for my application -- the application needs to be pretty snappy.)

Strictly checking grammar is neither right for the task. the grammar doesn't need to be perfect, and the sentences don't have to be any more sensible than the kind of English-like jibberish that you can generate using ngrams. Even if it's jibberish, I just need to enforce verb conjugation, number agreement, and do things like remove extra articles.

In fact, I don't even need any kind of suggestions for corrections. I think all I need is for something to tally up how many errors seem to occur in each sentence in a group of possible sentences, so I can sort by their score and pick the one with the least grammatical issues.

A simple solution? Scoring fluency by detecting obvious errors

If a script exists that takes care of all this, I'd be overjoyed (I haven't found one yet). I can write code for what I can't find, of course; I'm looking for advice on how to optimize my approach.

Let's say we have a tiny bit of text already laid out:

existing_text = "The old river"

Now let's say my script needs to figure out which inflection of the verb "to bear" could come next. I'm open to suggestions about this routine. But I need help mostly with step #2, rating fluency by tallying grammatical errors:

  1. Use the Verb Conjugation methods in NodeBox Linguistics to come up with all conjugations of this verb; ['bear', 'bears', 'bearing', 'bore', 'borne'].
  2. Iterate over the possibilities, (shallowly) checking the grammar of the string resulting from existing_text + " " + possibility ("The old river bear", "The old river bears", etc). Tally the error count for each construction. In this case the only construction to raise an error, seemingly, would be "The old river bear".
  3. Wrapping up should be easy... Of the possibilities with the lowest error count, select randomly.
like image 381
m_floer Avatar asked Jan 12 '12 21:01

m_floer


2 Answers

Very cool project, first of all.

I found a java grammar checker. I've never used it but the docs claim it can run as a server. Both java and listening to a port should be supported basically anywhere.

I'm just getting into NLP with a CS background so I wouldn't mind going into more detail to help you integrate whatever you decide on using. Feel free to ask for more detail.

like image 110
Johnny Brown Avatar answered Oct 06 '22 01:10

Johnny Brown


Another approach would be to use what is called an overgenerate and rank approach. In the first step you have your poetry generator generate multiple candidate generations. Then using a service like Amazon's Mechanical Turk to collect human judgments of fluency. I would actually suggest collecting simultaneous judgments for a number of sentences generated from the same seed conditions. Lastly, you extract features from the generated sentences (presumably using some form of syntactic parser) to train a model to rate or classify question quality. You could even thrown in the heuristics listed above.

Michael Heilman uses this approach for question generation. For more details, read these papers: Good Question! Statistical Ranking for Question Generation and Rating Computer-Generated Questions with Mechanical Turk.

like image 27
Lee Becker Avatar answered Oct 06 '22 00:10

Lee Becker