How to perform initialization in spark?

Tags:

apache-spark

I want to perform geoip lookups of my data in spark. To do that I'm using MaxMind's geoIP database.

What I want to do is to initialize a geoip database object once on each partition, and later use that to lookup the city related to an IP address.

Does spark have an initialization phase for each node, or should I instead check whether an instance variable is undefined, and if so, initialize it before continuing? E.g. something like (this is python but I want a scala solution):

class IPLookup(object):
    database = None

    def getCity(self, ip):
      if not database:
        self.database = self.initialise(geoipPath)
  ...

Of course, doing this requires spark will serialise the whole object, something which the docs caution against.

683

asked Nov 21 '14 17:11

jbrown

2 Answers

In Spark, per partition operations can be do using :

def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)

This mapper will execute the function f once per partition over an iterator of elements. The idea is that the cost of setting up resources (like DB connections) will be offset with the usage of such resources over a number of elements in the iterator.

Example:

val logsRDD = ???
logsRDD.mapPartitions{iter =>
   val geoIp = new GeoIPLookupDB(...)
   // this is local map over the iterator - do not confuse with rdd.map
   iter.map(elem => (geoIp.resolve(elem.ip),elem)) 
}

151

answered Nov 02 '22 12:11

maasg

This seems like a good usage of a broadcast variable. Have you looked at the documentation for that functionality and if you have does it fail to meet your requirements in someway?

answered Nov 02 '22 12:11

bearrito

Related questions
                            
                                Why does this list-of-futures to future-of-list transformation compile and work?
                            
                                How to install sbt-idea and use gen-idea [closed]
                            
                                Scala Default parameter value derived from previous parameters
                            
                                config directory when using sbt-native-packager
                            
                                Can minor version update be used to minor update transitive dependencies?
                            
                                Scala: Multiple implicit conversions with same name
                            
                                How to convert Long to Duration (import scala.concurrent.duration) in SCALA
                            
                                Using Scala class defined in package object from Java
                            
                                Is it possible to update fields of any case class implementing a common trait
                            
                                Defining Haskell FixF in scala
                            
                                Avoid multiple instances of Singleton? [duplicate]
                            
                                Create Map from Option of List
                            
                                Returning a path-dependent type from a pattern match
                            
                                Return all the indexes of a particular substring
                            
                                Slick 2.10-RC1, Scala 2.11.x, bypassing 22 arity limit with case class (heterogenous)
                            
                                What is the correct way to implement trait with generics in Scala?
                            
                                how to add sbteclipse plugin in eclipse
                            
                                How to build, compile and run a Scala project?
                            
                                What is the difference between using the return statement and defaulting to return the last value?
                            
                                KeeperErrorCode = NoNode for /brokers/topics/test-topic/partitions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With