I am searching for information on algorithms to process text sentences or to follow a structure when creating sentences that are valid in a normal human language such as English. I would like to know if there are projects working in this field that I can go learn from or start using. For example, if I gave a program a noun, provided it with a thesaurus (for related words) and part-of-speech (so it understood where each word belonged in a sentence) - could it create a random, valid sentence? I'm sure there are many sub-sections of this kind of research so any leads into this would be great.

The field you're looking for is called natural language generation, a subfield of natural language processing http://en.wikipedia.org/wiki/Natural_language_processing Sentence generation is either really easy or really hard depending on how good you want the sentences to be. Currently, there aren't programs that will be able to generate 100% sensible sentences about given nouns (even with a thesaurus) -- if that is what you mean. If, on the other hand, you would be satisfied with nonsense that was sometimes ungrammatical, then you could try an n-gram based sentence generator. These just chain together of words that tend to appear in sequence, and 3-4-gram generators look quite okay sometimes (although you'll recognize them as what generates a lot of spam email). Here's an intro to the basics of n-gram based generation, using NLTK: http://www.nltk.org/book/ch02.html#generating-random-text-with-bigrams

Computer AI algorithm to write sentences?

Tags:

parsing

artificial-intelligence

nlp

I am searching for information on algorithms to process text sentences or to follow a structure when creating sentences that are valid in a normal human language such as English. I would like to know if there are projects working in this field that I can go learn from or start using.

For example, if I gave a program a noun, provided it with a thesaurus (for related words) and part-of-speech (so it understood where each word belonged in a sentence) - could it create a random, valid sentence?

I'm sure there are many sub-sections of this kind of research so any leads into this would be great.

603

asked Apr 08 '11 17:04

Xeoncross

2 Answers

The field you're looking for is called natural language generation, a subfield of natural language processing http://en.wikipedia.org/wiki/Natural_language_processing

Sentence generation is either really easy or really hard depending on how good you want the sentences to be. Currently, there aren't programs that will be able to generate 100% sensible sentences about given nouns (even with a thesaurus) -- if that is what you mean.

If, on the other hand, you would be satisfied with nonsense that was sometimes ungrammatical, then you could try an n-gram based sentence generator. These just chain together of words that tend to appear in sequence, and 3-4-gram generators look quite okay sometimes (although you'll recognize them as what generates a lot of spam email).

Here's an intro to the basics of n-gram based generation, using NLTK: http://www.nltk.org/book/ch02.html#generating-random-text-with-bigrams

150

answered Sep 29 '22 21:09

silverasm

This is called NLG (Natural Language Generation), although that is mainly the task of generating text that describes a set of data. There is also a lot of research on completely random sentence generation as well.

One starting point is to use Markov chains to generate sentences. How this is done is that you have a transition matrix that says how likely it is to transition between every every part-of-speech. You also have the most likely starting and ending part-of-speech of a sentence. Put this all together and you can generate likely sequences of parts-of-speech.

Now, you are far from done, this will first of all not offer a very good result as you are only considering the probability between adjacent words (also called bi-grams), so what you want to do is to extend this to look for instance at the transition matrix between three parts-of-speech (this makes a 3D matrix and gives you trigrams). You can extend it to 4-grams, 5-grams, etc. depending on the processing power and if your corpus can fill such matrix.

Lastly, you need to patch up things such as object agreement (subject-verb-agreement, adjective-verb-agreement (not in English though), etc.) and tense, so that everything is congruent.

answered Sep 29 '22 22:09

Gustav Larsson

Related questions
                            
                                Any differences between terms parse trees and derivation trees?
                            
                                Jsoup remove nested tags but keep text
                            
                                How can I make sure all my Python code "compiles"?
                            
                                Using GSON in Android to parse a complex JSON object
                            
                                Why is this LR(1) grammar not LALR(1)?
                            
                                ANTLR grammar for Scala?
                            
                                How to parse formatted email address into display name and email address?
                            
                                asn.1 parser in C/Python
                            
                                CSS to JSON Parser or Converter
                            
                                Retrieve parsed data from CSV in Javascript object (using Papa Parse)
                            
                                How do you find the "main" picture of a website, given the URL?
                            
                                Shift/reduce conflicts in bison
                            
                                Parsing Valid JSON with TJSONObject using Embarcadero Code Example fails with exception
                            
                                Using Haskell's Parsec to parse binary files?
                            
                                Why are those two datetimes different?
                            
                                IPv6 parsing in C
                            
                                Parse an HTTP request Authorization header with Python
                            
                                how to parse wordpress post_meta table values
                            
                                Basics Introduction To Using CHCSVParser
                            
                                Java DateFormat parse() doesn't respect the timezone

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With