Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithms to identify Markov generated content?

Markov chains are a (almost standard) way to generate random gibberish which looks intelligent to untrained eye. How would you go about identifying markov generated text from human written text.

It would be awesome if the resources you point to are Python friendly.

like image 448
agiliq Avatar asked Jul 26 '09 19:07

agiliq


People also ask

What is Markov chain algorithm?

Markov chain is a systematic method for generating a sequence of random variables where the current value is probabilistically dependent on the value of the prior variable. Specifically, selecting the next variable is only dependent upon the last variable in the chain.

What is Markov model in NLP?

For NLP, a Markov chain can be used to generate a sequence of words that form a complete sentence, or a hidden Markov model can be used for named-entity recognition and tagging parts of speech. For machine learning, Markov decision processes are used to represent reward in reinforcement learning.

What are hidden Markov models and where this model is used in AI?

A Hidden Markov Model (HMM) is a statistical model which is also used in machine learning. It can be used to describe the evolution of observable events that depend on internal factors, which are not directly observable. A Hidden Markov Model (HMM) is a statistical model which is also used in machine learning.

How can you tell if a chain is Markov?

To determine if a Markov chain is regular, we examine its transition matrix T and powers, Tn, of the transition matrix. If we find any power n for which Tn has only positive entries (no zero entries), then we know the Markov chain is regular and is guaranteed to reach a state of equilibrium in the long run.


1 Answers

One simple approach would be to have a large group of humans read input text for you and see if the text makes sense. I'm only half-joking, this is a tricky problem.

I believe this to be a hard problem, because Markov-chain generated text is going to have a lot of the same properties of real human text in terms of word frequency and simple relationships between the ordering of words.

The differences between real text and text generated by a Markov chain are in higher-level rules of grammar and in semantic meaning, which are hard to encode programmatically. The other problem is that Markov chains are good enough at generating text that they sometimes come up with grammatically and semantically correct statements.

As an example, here's an aphorism from the kantmachine:

Today, he would feel convinced that the human will is free; to-morrow, considering the indissoluble chain of nature, he would look on freedom as a mere illusion and declare nature to be all-in-all.

While this string was written by a computer program, it's hard to say that a human would never say this.

I think that unless you can give us more specific details about the computer and human-generated text that expose more obvious differences it will be difficult to solve this using computer programming.

like image 89
James Thompson Avatar answered Sep 30 '22 14:09

James Thompson