I'm trying to compute pointwise mutual information (PMI). <img src="https://i.stack.imgur.com/MGOrI.gif" alt="enter image description here"> I have two RDDs as defined here for p(x, y) and p(x) respectively: <pre class="prettyprint"><code>pii: RDD[((String, String), Double)] pi: RDD[(String, Double)] </code></pre> Any code I'm writing to compute PMI from the RDDs <code>pii</code> and <code>pi</code> is not pretty. My approach is first to flatten the RDD <code>pii</code> and join with <code>pi</code> twice while massaging the tuple elements. <pre class="prettyprint lang-scala prettyprint-override"><code>val pmi = pii.map(x => (x._1._1, (x._1._2, x._1, x._2))) .join(pi).values .map(x => (x._1._1, (x._1._2, x._1._3, x._2))) .join(pi).values .map(x => (x._1._1, computePMI(x._1._2, x._1._3, x._2))) // pmi: org.apache.spark.rdd.RDD[((String, String), Double)] ... def computePMI(pab: Double, pa: Double, pb: Double) = { // handle boundary conditions, etc log(pab) - log(pa) - log(pb) } </code></pre> Clearly, this sucks. Is there a better (idiomatic) way to do this? Note: I could optimize the logs by storing the log-probs in <code>pi</code> and <code>pii</code> but choosing to write this way to keep the question clear.

Using <code>broadcast</code> would be a solution. <pre class="prettyprint lang-scala prettyprint-override"><code>val bcPi = pi.context.broadcast(pi.collectAsMap()) val pmi = pii.map { case ((x, y), pxy) => (x, y) -> computePMI(pxy, bcPi.value.get(x).get, bcPi.value.get(y).get) } </code></pre> Assume: <code>pi</code> has all <code>x</code> and <code>y</code> in <code>pii</code>.

Computing Pointwise Mutual Information in Spark

Tags:

apache-spark

apache-spark-mllib

I'm trying to compute pointwise mutual information (PMI).

enter image description here

I have two RDDs as defined here for p(x, y) and p(x) respectively:

pii: RDD[((String, String), Double)]
 pi: RDD[(String, Double)]

Any code I'm writing to compute PMI from the RDDs pii and pi is not pretty. My approach is first to flatten the RDD pii and join with pi twice while massaging the tuple elements.

val pmi = pii.map(x => (x._1._1, (x._1._2, x._1, x._2)))
             .join(pi).values
             .map(x => (x._1._1, (x._1._2, x._1._3, x._2)))
             .join(pi).values
             .map(x => (x._1._1, computePMI(x._1._2, x._1._3, x._2)))
// pmi: org.apache.spark.rdd.RDD[((String, String), Double)]
...
def computePMI(pab: Double, pa: Double, pb: Double) = {
  // handle boundary conditions, etc
  log(pab) - log(pa) - log(pb)
}

Clearly, this sucks. Is there a better (idiomatic) way to do this? Note: I could optimize the logs by storing the log-probs in pi and pii but choosing to write this way to keep the question clear.

825

asked Apr 14 '15 06:04

Delip

1 Answers

Using broadcast would be a solution.

val bcPi = pi.context.broadcast(pi.collectAsMap())
val pmi = pii.map {
  case ((x, y), pxy) =>
    (x, y) -> computePMI(pxy, bcPi.value.get(x).get, bcPi.value.get(y).get)
}

Assume: pi has all x and y in pii.

128

answered Sep 30 '22 10:09

emeth

Related questions
                            
                                How to pass one RDD in another RDD through .map
                            
                                Spark IDF for new documents
                            
                                Using Spark for sequential row-by-row processing without map and reduce
                            
                                From TF-IDF to LDA clustering in spark, pyspark
                            
                                Collapse a Spark DataFrame
                            
                                java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
                            
                                Spark ClassNotFoundException running the master
                            
                                how does pyspark broadcast variables work
                            
                                Checking for equality of RDDs
                            
                                Equivalent to getLines in Apache Spark RDD
                            
                                Spark Cassandra Connector keyBy and shuffling
                            
                                Is this a regression bug in Spark 1.3?
                            
                                Spark on yarn mode end with "Exit status: -100. Diagnostics: Container released on a *lost* node"
                            
                                Spark RDD's - how do they work
                            
                                What is going wrong with `unionAll` of Spark `DataFrame`?
                            
                                Pyspark --py-files doesn't work
                            
                                Spark SQL DataFrame - distinct() vs dropDuplicates()
                            
                                Reading CSV into a Spark Dataframe with timestamp and date types
                            
                                How to fix Connection reset by peer message from apache-spark?
                            
                                pyspark Column is not iterable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With