Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inference Labeled LDA/pLDA [Topic Modelling Toolbox]

I have been trying to get through with the code for inference from trained labeled LDA model and pLDA using TMT toolbox(stanford nlp group). I have gone through the examples provided in the following links: http://nlp.stanford.edu/software/tmt/tmt-0.3/ http://nlp.stanford.edu/software/tmt/tmt-0.4/

Here is the code I'm trying for labeled LDA inference

val modelPath = file("llda-cvb0-59ea15c7-31-61406081-75faccf7");

val model = LoadCVB0LabeledLDA(modelPath);`

val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);

val text = {
  source ~>                              // read from the source file
  Column(4) ~>                           // select column containing text
  TokenizeWith(model.tokenizer.get)      //tokenize with model's tokenizer
 }

 val labels = {
  source ~>                              // read from the source file
  Column(2) ~>                           // take column two, the year
  TokenizeWith(WhitespaceTokenizer())     
 }

 val outputPath = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv",""));

 val dataset = LabeledLDADataset(text,labels,model.termIndex,model.topicIndex);

 val perDocTopicDistributions =  InferCVB0LabeledLDADocumentTopicDistributions(model, dataset);

 val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);

 TSVFile(outputPath+"-word-topic-distributions.tsv").write({
  for ((terms,(dId,dists)) <- text.iterator zip perDocTermTopicDistributions.iterator) yield {
    require(terms.id == dId);
    (terms.id,
     for ((term,dist) <- (terms.value zip dists)) yield {
       term + " " + dist.activeIterator.map({
         case (topic,prob) => model.topicIndex.get.get(topic) + ":" + prob
       }).mkString(" ");
     });
  }
});

Error

found : scalanlp.collection.LazyIterable[(String, Array[Double])] required: Iterable[(String, scalala.collection.sparse.SparseArray[Double])] EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);

I understand it's a type mismatch error. But I don't know how to resolve this for scala. Basically I don't understand how should I extract the 1. per doc topic distribution 2. per doc label distribution after the output of the infer command.

Please help. Same in the case of pLDA. I reach the inference command and after that clueless what to do with it.

like image 915
Rohit Jain Avatar asked Mar 23 '26 02:03

Rohit Jain


1 Answers

Scala type system is much more complex then Java one and understanding it will make you a better programmer. The problem lies here:

val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);

because either model, or dataset, or perDocTopicDistributions are of type:

scalanlp.collection.LazyIterable[(String, Array[Double])]

while EstimateLabeledLDAPerWordTopicDistributions.apply expects a

Iterable[(String, scalala.collection.sparse.SparseArray[Double])]

The best way to investigate this type errors is to look at the ScalaDoc (for example the one for tmt is there: http://nlp.stanford.edu/software/tmt/tmt-0.4/api/#package ) and if you cannot find out where the problem lies easily, you should explicit the type of your variables inside your code like the following:

 val perDocTopicDistributions:LazyIterable[(String, Array[Double])] =  InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)

If we look together to the javadoc of edu.stanford.nlp.tmt.stage:

def
EstimateLabeledLDAPerWordTopicDistributions (model: edu.stanford.nlp.tmt.model.llda.LabeledLDA[_, _, _], dataset: Iterable[LabeledLDADocumentParams], perDocTopicDistributions: Iterable[(String, SparseArray[Double])]): LazyIterable[(String, Array[SparseArray[Double]])]

def
InferCVB0LabeledLDADocumentTopicDistributions (model: CVB0LabeledLDA, dataset: Iterable[LabeledLDADocumentParams]): LazyIterable[(String, Array[Double])]

It now should be clear to you that the return of InferCVB0LabeledLDADocumentTopicDistributions cannot be used directly to feed EstimateLabeledLDAPerWordTopicDistributions.

I never used stanford nlp but this is by design how the api works, so you only need to convert your scalanlp.collection.LazyIterable[(String, Array[Double])] into Iterable[(String, scalala.collection.sparse.SparseArray[Double])] before calling the function.

If you look at the scaladoc on how to do this conversion it's pretty simple. Inside the package stage, in package.scala I can read import scalanlp.collection.LazyIterable;

So I know where to look, and in fact inside http://www.scalanlp.org/docs/core/data/#scalanlp.collection.LazyIterable you have a toIterable method which turns a LazyIterable into an Iterable, still you have to transform your internal array into a SparseArray

Again, I look into the package.scala for the stage package inside tmt and I see: import scalala.collection.sparse.SparseArray; And I look for scalala documentation :

http://www.scalanlp.org/docs/scalala/0.4.1-SNAPSHOT/#scalala.collection.sparse.SparseArray

It turns out that the constructors seems complicated to me, so it sounds much like I would have to look into the companion object for a factory method. It turns out that the method I am looking for is there, and it's called apply like as usual in Scala.

def
apply [T] (values: T*)(implicit arg0: ClassManifest[T], arg1: DefaultArrayValue[T]): SparseArray[T]

By using this, you can write a function with the following signature:

def f: Array[Double] => SparseArray[Double]

Once this has done, you can turn your result of InferCVB0LabeledLDADocumentTopicDistributions into a non-lazy iterable of sparse Array with one line of code:

result.toIterable.map { case (name, values => (name, f(values)) }
like image 102
Edmondo1984 Avatar answered Mar 24 '26 16:03

Edmondo1984



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!