Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is distant supervision?

According to my understanding, Distant Supervision is the process of specifying the concept which the individual words of a passage, usually a sentence, are trying to convey.

For example, a database maintains the structured relationship concerns( NLP, this sentence).

Our distant supervision system would take as input the sentence: "This is a sentence about NLP."

Based on this sentence it would recognize the entities, since as a pre-processing step the sentence would have been passed through a named-entity recognizer, NLP & this sentence.

Since our database has it that NLP and this sentence are related by the bond of concern(s) it would identify the input sentence as expressing the relationship Concerns(NLP, this sentence).

My questions is two fold:

1) What is the use of that? Is it that later our system might see a sentence in "the wild" such as That sentence is about OPP and realize that it's seen something similar to that before and thereby realize the novel relationship such that concerns(OPP, that sentence)., based only on the words/ individual tokens?

2) Does it take into account the actual words of the sentence? The verb 'is' and the adverb 'about' for instance, realizing (through WordNet or some other hyponymy system) that this is somehow similar to the higher-order concept "concerns"?

Does anyone have some code used to generate a distant supervision system that I could look at, i.e. a system that cross references a KB, such as Freebase, and a corpus, such as the NYTimes, and produces a distant supervision database? I think that would go a long way in clarifying my conception of distant supervision.

like image 295
smatthewenglish Avatar asked Apr 11 '15 08:04

smatthewenglish


People also ask

What does distant supervision mean?

"Distant supervision" is a learning scheme in which a classifier is learned given a weakly labeled training set (training data is labeled automatically based on heuristics / rules).

What is distant supervision for relation extraction?

Distant supervision for relation extraction is an efficient method to reduce labor costs and has been widely used to seek novel relational facts in large corpora, which can be identified as a multi-instance multi-label problem.

How do you do semi supervised learning?

Here's how it works: Train the model with the small amount of labeled training data just like you would in supervised learning, until it gives you good results. Then use it with the unlabeled training dataset to predict the outputs, which are pseudo labels since they may not be quite accurate.

What are weak labels?

Types of weak labels Weak labels are intended to decrease the cost and increase the efficiency of human efforts expended in hand-labeling data.


2 Answers

RE 1) Yes, this is exactly right. In the end, what we want is a classifier that takes as input text, and a pair of entity mentions in the text, and tells us what relation holds between those entities in that sentence. Distant supervision is a way of mocking this training data, using "distant supervision" from a known knowledge base. But, the end goal is the same as most machine learning tasks: generalize to new sentences.

RE 2) Certainly! Distant supervision only applies to how the training data is generated [1]. Once you've assumed distant supervision, what you're left with is a corpus of (sentence, relation_for_sentence) pairs, and then you extract all of the usual NLP features on the sentence.

[1] To a first approximation -- there are "distantly supervised" models (like MultiR and MIML-RE) which don't directly generate fake training data, but incorporate the supervision indirectly into the training procedure itself. But, even in these, there is a factor in the latent-variable model that amounts to a per-sentence classification, and it's just that the output variable is latent rather than naively "observed" as in vanilla distant supervision.

like image 172
Gabor Angeli Avatar answered Sep 25 '22 18:09

Gabor Angeli


according to my understanding now- the real value of distant supervision is that we can use it to annotate a big corpus without having to manually consider each sentence- since this is very expensive in terms of person hours- so in the end some of the recognized relationships in sentences will be false- but it will be- hopefully "pretty good"... which is useful- in some applications such as... academics competing with eachother to try to get marginally better scores on this silly task and... other things such as... (examples are welcome)

like image 39
smatthewenglish Avatar answered Sep 22 '22 18:09

smatthewenglish