Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accurate text generation

I have a chatting app that works with predefined messages. The database has about 80 predefined conversations each with 5 possible responses. To clarify, here's an example:

Q: "How heavy is a polar bear?"

R1: "Very heavy?"
R2: "Heavy enough to break the ice."
R3: "I don't know. Silly question."
R4: ...
R5: ...

Let's say a user will choose R3: "I don't know. Silly question"

Then that response will have 5 possible responses, e.g.:

R1: "Why is that silly?"
R2: "You're silly!"
R3: "Ugh. I'm done talking to you now."
R4: ...
R5: ...

And each of those responses will have 5 possible responses; after which, the conversation will end and a new one will have to be started.

So to recap, I have 80 manually-written conversations, each with 5 possible responses, going 3 layers deep = 10,000 messages total.

My question: What would be the most accurate way of automatically generating more conversations such as these using machine learning?

I researched RNN: Karparthy's RNN post. Although RNN can make new content based on the old, the new content is quite random and nonsensical.

For better understanding of the use of these conversations, please visit http://getvene.com/ and watch the preview video.

like image 236
Ilya Karnaukhov Avatar asked Jun 13 '17 14:06

Ilya Karnaukhov


2 Answers

I would probably start with a generative text model. There is a nice article that uses Python and Keras (you can however use LSTM recurrent neural network also with TensorFlow). With a good and rich set of training data the algorithm can indeed produce pretty interesting text outputs. As mentioned in the article above, there is a Gutenberg project where you can find an impressive number of free books for free. That should provide sufficient amount of training data. However, since you probably already played with RNN I will proceed further.

Next thing are the relations between a question and possible responses. This tells me that there is a certain semantics involved in your conversations. Meaning that it's not random and generated responses should at least try to "fit" into somewhat relevant response. Something like Latent Dirichlet Allocation to find a proper categories and topics based on data but in reversed way - based on topic (question) you need to find out at least somehow relevant data (responses). Perhaps some way of splitting the generated text into many parts and then vectorize these parts and use something like Document Distance algorithm to find the close match? An idea that could also come handy is the Latent Semantic Analysis because in fact, from a matrix of words/vectors you need to reduce the matrix as much as you can while still preserving the similarities.

like image 93
Ivan Sivak Avatar answered Nov 19 '22 21:11

Ivan Sivak


I recommend to use PPDB http://www.cis.upenn.edu/~ccb/ppdb/ to rephrase your phrases to expand your training data . Check out this paper for example: https://www.aclweb.org/anthology/P/P16/P16-2.pdf#page=177 you can use similar approach to rephrase each sentence.

like image 40
Ramtin M. Seraj Avatar answered Nov 19 '22 21:11

Ramtin M. Seraj