Markov chains are a (almost standard) way to generate random gibberish which looks intelligent to untrained eye. How would you go about identifying markov generated text from human written text.
It would be awesome if the resources you point to are Python friendly.
Markov chain is a systematic method for generating a sequence of random variables where the current value is probabilistically dependent on the value of the prior variable. Specifically, selecting the next variable is only dependent upon the last variable in the chain.
For NLP, a Markov chain can be used to generate a sequence of words that form a complete sentence, or a hidden Markov model can be used for named-entity recognition and tagging parts of speech. For machine learning, Markov decision processes are used to represent reward in reinforcement learning.
A Hidden Markov Model (HMM) is a statistical model which is also used in machine learning. It can be used to describe the evolution of observable events that depend on internal factors, which are not directly observable. A Hidden Markov Model (HMM) is a statistical model which is also used in machine learning.
To determine if a Markov chain is regular, we examine its transition matrix T and powers, Tn, of the transition matrix. If we find any power n for which Tn has only positive entries (no zero entries), then we know the Markov chain is regular and is guaranteed to reach a state of equilibrium in the long run.
One simple approach would be to have a large group of humans read input text for you and see if the text makes sense. I'm only half-joking, this is a tricky problem.
I believe this to be a hard problem, because Markov-chain generated text is going to have a lot of the same properties of real human text in terms of word frequency and simple relationships between the ordering of words.
The differences between real text and text generated by a Markov chain are in higher-level rules of grammar and in semantic meaning, which are hard to encode programmatically. The other problem is that Markov chains are good enough at generating text that they sometimes come up with grammatically and semantically correct statements.
As an example, here's an aphorism from the kantmachine:
Today, he would feel convinced that the human will is free; to-morrow, considering the indissoluble chain of nature, he would look on freedom as a mere illusion and declare nature to be all-in-all.
While this string was written by a computer program, it's hard to say that a human would never say this.
I think that unless you can give us more specific details about the computer and human-generated text that expose more obvious differences it will be difficult to solve this using computer programming.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With