I am working on a project where I would like to achieve a sense of natural language understanding. However, I am going to start small and would like to train it on specific queries.
So for example, starting out I might tell it:
songs.
Then if it sees a sentence like "Kanye Wests songs" it can match against that.
BUT then I would like to give it some extra sentences that could mean the same thing so that it eventually learns to be able to predict unknown sentences into a set that I have trained it on.
So I might add the sentence: "Songs by
And of course would be a database of names it can match agains.
I came across a neat website, Wit.ai that does something like I talk about. However, they resolve their matches to an intent, where I would like to match it to a simplified query or BETTER a database like query (like facebook graph search).
I understand a context free grammar would work well for this (anything else?). But what are good methods to train several CFG that I say have similar meaning and then when it sees unknown sentences it can try and predict.
Any thoughts would be great.
Basically I would like to be able to take a natural language sentence and convert it to some form that can be run better understood to my system and presented to the user in a nice way. Not sure if there is a better stackexchange for this!
To begin with, I think SO is quite well-suited for this question (I checked Area 51, there is no stackexchange for NLP).
Under the assumption that you are already familiar with the usual training of PCFG grammars, I am going to move into some specifics that might help you achieve your goal:
Any grammar trained on a corpus is going to be dependent on the words in that training corpus. The poor performance on unknown words is a well-known issue in not just PCFG training, but in pretty much any probabilistic learning framework. What we can do, however, is to look at the problem as a paraphrasing issue. After all, you want to group together sentences that have the same meaning, right?
In recent research, detecting sentences or phrases that have the same (or similar) meaning have employed a technique known as as distributional similarity. It aims at improving probability estimation for unseen cooccurrences. The basic concept is
words or phrases that share the same distribution—the same set of words in the same context in a corpus—tend to have similar meanings.
You can use only intrinsic features (e.g. production rules in PCFG) or bolster such features with additional semantic knowledge (e.g. ontologies like FreeBase). Using additional semantic knowledge enables generation of more complex sentences/phrases with similar meanings, but such methods usually work well only for specific domains. So, if you want your system to work well only for music, it's a good idea.
Reproducing the actual distributional similarity algorithms will make this answer insanely long, so here's a link to an excellent article:
Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods by Madnani and Dorr.
For your work, you will only need to go through section 3.2: Paraphrasing Using a Single Monolingual Corpus. I believe the algorithm labeled as 'Algorithm 1' in this paper will be useful to you. I am not aware of any publicly available tool/code that does this, however.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With