Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP/Quest. Answering - Retrieving information from DB

I've been doing a bit of reading up on NLP recently, and so far I've got a (very) basic idea of how everything works, ranging from sentence splitting to POS-tagging, and also knowledge representation.

I understand that there's a wide diversity of NLP libraries out there (mostly in Java or Python) and have found a .NET implementation (SharpNLP). It's been excellent actually. No need to write any custom processing logic; just use their functions and voila! user input is well-separated and POS-tagged.

What I don't understand is where to go from here, if my main motivation is to build a Question Answering system (something like a chatterbot). What libraries (preferably .NET) are available for me to use? If I wish to construct my own KB, how should I represent my knowledge? Do I need to parse the POS-tagged input into something else that my DB can understand? And if I'm using MS SQL, is there any library that helps map POS-tagged input to database queries? Or do I need to write my own database querying logic, according to procedural semantics (I've read)?

The next step, of course, is to formulate a well-constructed reply, but I think I can leave that for later. Right now what is bugging me is the lack of resources in this area (knowledge representation, NLP to KB/DB-retrieval), and I'd really appreciate it if anyone of you there could offer me your expertise :)

like image 793
matt Avatar asked Jan 15 '23 04:01

matt


1 Answers

This is a very broad question and as such it barely fits the format for StackOverflow, never the less I'd like to give it a stab.

First, a word on NLP
The broad availability of mature tools in the area of NLP is in itself somewhat misleading. Certainly all/most NLP functions, from, say, POS-tagging or Chunking to, say, Automatic summarization or Named Entity Recognition are covered and generally well served by the logic and supporting data of the various libraries. However building real world solutions from these building blocks is hardly a trivial task. One needs to:

  • architect a solution along some sort of pipeline or chain whereby the results of a particular transformation feed into the input of subsequent processes.
  • configure the individual processes: the computational framework of these are well established but they are however extremely sensitive to underlying data such as training/reference corpus, optional tuning parameters etc.
  • select and validate the proper functions/processes.

The above is particularly difficult for part of the solution associated with the extraction and handling of semantic elements from the text (Information Extraction at large, but also co-reference disambiguation, relationship extraction or sentiment analysis, to name a few). These NLP functions and the corresponding implementations in various libraries tend to be harder to configure, more sensitive to domain-dependent patterns or to variations in level of speech or even in the "format" of supporting corpora.

In a nutshell, NLP libraries provide essential building blocks for applications such as the "Question Answering systems" mentioned in the question, but much "glue" and much discretion as to how and where to apply the glue is required (along with a good dose of non-NLP technologies such as the issue of knowledge representation, discussed below).

On knowledge representation
As hinted above, POS-tagging alone isn't a sufficient element of the NLP pipeline. Essentially POS-tagging will add information about each word in the text, indicating the [likely] grammatical role of the word (as in Noun vs. Adjective vs Verb vs. Pronoun etc.) This POS information is quite useful as it allows, for example, subsequent chunking of the text into logically related groups of words and/or a more precise lookup of individual words in dictionaries, taxonomies or ontologies.

To illustrate the kind of information extraction and the underlying knowledge representation that may be required for some "Question Answering system", I'll discuss a common format used in various Semantic Search Engines. Beware however that this format is maybe more conceptual than prescriptive for Semantic Search and that other applications such as Expert Systems or Translation machines require yet other forms of knowledge representation.

The idea is to use NLP techniques along with supporting data (from plain "lookup tables" for simple lexicons, to tree-like structures for taxonomies, to ontologies expressed in specialized languages) to extract triplets of entities from the text, with the following structure:

  • an agent: something or somebody "doing" something
  • a verb : what is being done
  • an object : a person or item upon which the "doing" is done (or more generically, some complement of information about the "doing")

Examples:
  cat/Agent eat/Verb mouse/Object.
  John-Grisham/agent write/verb The-Pelican-Brief/Object
  cows/Agent produce/verb milk/Object

Furthermore this kind of triplets, sometimes called "facts", can be categorized into various types corresponding to specific patterns of semantic, typically organized around the semantics of the verb. For example "Cause-Effect" facts have a verb which express some causality, "Contains" facts have a verb which imply a container-to-containee relationship, "Definition" facts are for patterns where the agent/subject is defined [if only partially] by the object (e.g. "cats are mammals"), etc.

One can easily imagine how such databases of facts can be queried to supply answers to questions, and also to provide various smarts and services such as synonym substitution or improving the relevance of answers to the questions (compared with plain keyword matching).

The real difficulty is in extracting facts from the text. Many NLP functions are put into play for that purpose. For example, one of the steps in the NLP pipeline is to replace pronouns by the noum they reference (anaphora resolution or more generally co-reference resolution in NLP lingo). Another step is to identify Named Entities: names of people, of geographic places, of books etc.(NER in NLP lingo). Another step may be to rewrite clauses joined by "AND" as so to create facts by repeating grammatical elements that are implied.
For example, maybe the John Grisham example above came from a text excerpt like
Author J. Grisham was born in Arkansas. He wrote "A time to Kill" in 1989 and "The Pelican Brief" in 1992"

Getting to John-Grisham/Agent wrote/Verb The-Pelican-Brief/Object implies (among other things):

  • identifying "J. Grisham" and "The Pelican Brief" as specific entities.
  • replacing "He" by "John-Grisham" in the 2nd sentence.
  • rewriting the 2nd sentence as two facts: "John-Grisham wrote A-time-to-kill in 1989" and "John-Grisham wrote The-Pelican-brief in 1992"
  • dropping the "in 1992" part (or better yet, creating another fact, a "Time fact": "The-Pelican-brief/Agent is-related-in-time/verb year-1992/object") (btw this would also imply having identified 1992 as being a time entity of type "year".)

In a nutshell: Information extraction is a complicated task even when applied to relatively limited domains and when leveraging existing NLP functions available in a library. It is certainly a much "messier" activity than merely identifying the nouns from the adjectives and the verbs ;-)

like image 93
mjv Avatar answered Jan 21 '23 16:01

mjv