I need to generate paraphrase of an english sentence using the PPDB paraphrase database
I have downloaded the datasets from the website.
I would say your first step needs to be reducing the problem into more manageable components. Second figure out whether you want to paraphrase on a one-to-one, lexical, syntactical, phrase or combination basis. To inform this decision I would take one sentence and paraphrase it myself in order to get an idea of what I am looking for. Next I would start writing a parser for the downloaded data. Then I would remove the stopwords and incorporate a part-of-speech tagger like the ones included in spaCy or nltk for your example phrase.
Since they seem to give you all the information needed to make a successive dictionary filter that is where I would start. I would write a filter which found the parts of speech for each word in my sentence in the [LHS] column of the dataset and select a source that matches the word while minimizing/maximizing the value of 1 feature (like minimizing WordLenDiff) which in the case of "businessnow" <- "business now" = -1.5. Keeping track of the target feature you will then have a basic paraphrased sentence.
using this strategy your output could turn:
"the business uses 4 gb standard."
sent_score = 0
into:
"businessnow uses 4gb standard"
sent_score = -3
After you have a basic example the you can start exploring feature selection algorithms in like those in scikit-learn, etc. and incorporate word alignment. But I would seriously cut down on the scope of the problem and increase it gradually. In the end, how you approach the problem it depends on what the designated use is and how functional it needs to be.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With